MAP model selection in Gaussian regression

We consider a Bayesian approach to model selection in Gaussian linear regression, where the number of predictors might be much larger than the number of observations. From a frequentist view, the proposed procedure results in the penalized least squares estimation with a complexity penalty associated with a prior on the model size. We investigate the optimality properties of the resulting estimator. We establish the oracle inequality and specify conditions on the prior that imply its asymptotic minimaxity within a wide range of sparse and dense settings for"nearly-orthogonal"and"multicollinear"designs.


Introduction
Consider the standard Gaussian linear regression model where y ∈ R n is a vector of the observed response variable Y , X n×p is the design matrix of the p explanatory variables (predictors) X 1 , ..., X p , β ∈ R p is a vector of unknown regression coefficients, ǫ ∼ N (0, σ 2 I n ) and the noise variance σ 2 is assumed to be known.
A variety of statistical applications of regression models involves a vast number of potential explanatory variables that might be even large relatively to the amount of available data. It raises a severe "curse of dimensionality" problem. Reducing dimensionality of the model becomes therefore crucial in the analysis of such large data sets. The goal of model (or variable) selection is to select the "best", parsimonious subset of predictors. The corresponding coefficients are then usually estimated by least squares. The meaning of the "best" subset however depends on the particular aim at hand. One should distinguish, for example, between estimation of regression coefficients β, proportional to the model size. The "best" model then is the one with the minimal quadratic risk. Note that the true underlying model in (1) is not necessarily the best in this sense since sometimes it is possible to reduce its risk by excluding predictors with small (but still nonzero!) coefficients.
Such a criterion for model selection is obviously impossible to implement since it depends on the unknown β. Instead, the corresponding ideal minimal risk can be used as a benchmark for any available model selection procedure. The model selection criteria are typically based on the empirical quadratic risk ||y − Xβ M || 2 , which is essentially the least squares. However, direct minimization of the empirical risk evidently leads to a trivial (unsatisfactory!) choice of the saturated model. A typical remedy is then to add a complexity penalty P en(|M |) that increases with the model size, and to consider penalized least squares criterion of the form The properties of the resulting estimator depends on the proper choice of the complexity penalty function P en(·) in (2). There exists a plethora of works in literature on this problem. The standard, most commonly used choice is a linear type penalty of the form P en(k) = 2σ 2 λk for some fixed λ > 0. The most known examples motivated by different ideas include AIC for λ = 1 (Akaike, 1974), BIC for λ = (ln n)/2 (Schwarz, 1978) and RIC for λ = ln p (Foster & George, 1994). A series of recent works suggested the so-called 2k ln(p/k)-type nonlinear penalties of the form P en(k) = 2σ 2 ck(ln(p/k) + ζ p,k ), where c > 1 and ζ p,k is some "negligible" term (see, e.g., Birgé  First, under mild conditions on the prior we establish the oracle inequality and show that, up to a constant multiplier, they achieve the minimal possible risk among all estimators. We then investigate their asymptotic minimaxity. For "nearly-orthogonal" design they are proved to be simultaneously rate-optimal (in the minimax sense) over a wide range of sparse and dense settings and outperform various existing model selection procedures, e.g. AIC, BIC, RIC, Lasso (Tibshirani, 1996) and Dantzig selector (Candés & Tao, 2007). In a way, these results extend those of Abramovich, Grin- The analysis of "multicollinear" design, which is especially relevant for "p much larger than n" setup, is more delicate. We demonstrate that the lower bounds for the minimax rates for estimating the mean vector in this case are smaller than those for "nearly-orthogonal" design by the factor depending on the design properties. Such "blessing of multicollinearity" can be explained by a possibility of exploiting correlations between predictors to reduce the size of a model (hence, to decrease the variance) without paying much extra price in the bias term. We show that under some additional assumptions on the design and the coefficients vector β in (1), the proposed Bayesian model selectors are still asymptotically rate-optimal.
The paper is organized as follows. The Bayesian model selection procedure that leads to a penalized least squares estimator (2) is introduced in Section 2. In Section 3 we derive an upper bound for the quadratic risk of the resulting MAP model selector, compare it with that of an oracle and find the conditions on the prior where, up to a constant multiplier, it achieves the minimal possible risk among all estimators. In Section 4 we obtain the upper and lower risk bounds of the MAP model selector in a sparse setup that allows us to investigate its asymptotic minimaxity for nearly-orthogonal and multicollinear designs in Section 5. The computational aspects are discussed in Section 6, and the main take-away messages of the paper are summarized in Section 7. All the proofs are given in the Appendix. where "+" denotes the generalized inverse matrix.
For any k = 0, ..., r − 1 there are p k different models of a given size k. Assume all of them to be equally likely, that is, conditionally on |M | = k, One should be a little bit more careful for k = r = rank(X). Although there are p r different sets of predictors of size r, all of them evidently result in the same estimator for the mean vector and, in this sense, are essentially undistinguishable and associated with a single (saturated) model. Hence, in this case, we set Finally, assume the normal prior on the unknown vector of k coefficients . This is a well-known conventional g-prior of Zellner (1986).
For the proposed hierarchical prior, straightforward calculus yields the posterior probability of a model M of size |M | = 0, ..., r − 1 : Finding the most likely model leads therefore to the following maximum a posteriori (MAP) model selection criterion: which is of the general type (2) with the complexity penalty Similarly, for |M | = r from (3) one has P en(r) = 2σ A specific form of the penalty (6)-(7) depends on the choice of a prior π(·). In particular, the (truncated if p > n) binomial prior B(p, ξ) corresponds to the prior assumption that the indicators d jM are independent. The binomial prior yields the linear penalty P en(k) = 2σ 2 λk, where leads to the RIC criterion. These relations indicate that RIC should be appropriate for sparse cases, where the size of the true (unknown) model is believed to be much less than the number of possible predictors, while AIC is suitable for dense cases, where they are of the same order. In fact, any binomial prior or, equivalently, any linear penalty cannot "kill two birds with one stone". On the other hand, the (truncated) geometric prior π(k) ∝ q k , k = 1, ..., r for some 0 < q < 1, implies P en(k) ∼ 2σ 2 (1 + 1/γ)k(ln(p/k) + c(γ, q)) which is of the 2k ln(p/k)-type introduced above. For large γ it behaves similar to RIC for k ≪ p and to AIC for k ∼ p and is, therefore, adaptive to both sparse and dense cases.
We will discuss these issues more rigorously in Section 5 below.

Oracle inequality
In this section we derive an upper bound for the quadratic risk of the proposed MAP model selector and compare it with the ideal minimal quadratic risk often called in literature as an oracle risk.
Assumption (P). Assume that Assumption (P) is not restrictive. Indeed, the obvious inequality p k ≥ (p/k) k implies that for k < r it automatically holds forany prior π(k) for all k ≤ pe −c(γ) . Assumption (P) is used to establish an upper bound for the quadratic risk of the MAP model selector.
Theorem 1. Let the modelM be the solution of (2) with the complexity penalty P en(·) given in (6)-(7) andβM be the corresponding least squares estimate. Then, under Assumption (P) for some c 0 (γ) and c 1 (γ) depending only on γ.
To assess the quality of the upper bound in (8), we compare it with the oracle risk inf M E||Xβ M − Xβ|| 2 . Note that the oracle risk is exactly zero when β ≡ 0 and, evidently, no estimator can achieve it in this case. Hence, an additional, typically negligible term σ 2 , which is, essentially, an error of estimating a single extra parameter, is usually added to the oracle risk for a proper comparison. It is known that no estimator can attain a risk smaller than within 2 ln p factor from that of an oracle (e.g., Foster & George, 1994;Donoho & Johnstone, 1995;Candès, 2006). The following theorem shows that under certain additional conditions on the prior π(·), the resulting MAP model selector achieves this minimal possible risk among all estimators up to a constant multiplier depending on γ: Theorem 2 (oracle inequality). Let π(k) satisfy Assumption (P) and, in addition, π(0) ≥ p −c , π(k) ≥ p −ck , k = 1, ..., r for some constant c > 0.
Then, the resulting MAP model selector satisfies In particular, it can be easily shown that Theorem 2 holds for the (truncated) binomial B(p, ξ) with ξ = 1/p (RIC criterion) and geometric priors (see Section 2). More generally, all priors such that ln π(k) = O(k ln(k/p)) corresponding to the 2k ln(p/k)-type penalties satisfy the conditions of Theorem 2.

Risk bounds for sparse settings
In the previous section we considered the global behavior of the MAP estimator without any restrictions on the model size. However, in the analysis of large data sets, it is typically reasonable to assume that the true model in (1)  Theorem 3. Let 1 ≤ p 0 ≤ r, π(·) satisfy Assumption (P) and, in addition, where the infimum is taken over all estimatesŷ of the mean vector Xβ.

Nearly-orthogonal design
In this section we consider the asymptotic properties of the MAP model selector as the sample size n increases. We allow p = p n to increase with n as well and look for a projection of the unknown mean vector on an expanding span of predictors. In particular, the most challenging cases intensively studied nowadays in literature are those, where p > n or even p ≫ n. In such asymptotic settings one should essentially consider a sequence of design matrices X n,pn , where r n → ∞. For simplicity of exposition, in what follows we omit the index n and denote X n,pn by X p emphasizing the dependence on the number of predictors p and let r tend to infinity. Similarly, we consider now sequences of coefficients vectors β p and priors π p (·). In these notations, the original model (1) is transformed into a sequence of models where rank(X p ) = r and any r columns of X p are linearly independent (hence, τ p [r] > 0), ǫ ∼ N (0, σ 2 I n ) and the noise variance σ 2 does not depend on n and p. One can also view a sequence of models (11)  Nearly-orthogonality condition essentially means that there is no multicollinearity in the design in the sense that there are no "too strong" linear relationships within any set of r columns of X p . Intuitively, it is clear that in this case p cannot be "too large" relative to r and, therefore, to n. Indeed, apply the upper and lower bounds (9), (10) for p 0 = r/2 to get (C 2 /2)σ 2 τ [r]r(ln(2p/r) + 1) ≤ C 1 (γ)σ 2 r that implies the following remark: Remark 1. For nearly-orthogonal design, necessarily p = O(r) and, there- The following corollary is an immediate consequence of Theorems 3 and 4: Corollary 1. Let the design be nearly-orthogonal.
1. As r increases, the asymptotic minimax risk of estimating the mean vector X p β p over M p 0 is of the order min(p 0 (ln(p/p 0 ) + 1), r), that is, there exist two constants 0 < C 1 ≤ C 2 < ∞ such that for all sufficiently large r, for all 1 ≤ p 0 ≤ r.
One can easily verify that the conditions on the prior of Corollary 1 are satisfied, for example, for the truncated geometric prior (see Section 2) for all k = 1, ..., r. The resulting MAP model selector attains, therefore, the minimax rates simultaneously for all M p 0 , p 0 = 1, ..., r. As we have mentioned, the corresponding penalty in (6) is of the 2k ln(p/k)-type. On the other hand, no truncated binomial prior B(p, ξ p ) can satisfy these conditions on the entire range k = 1, ..., r. It is easy to verify that they hold for small ξ p if k ≪ r but for large ξ p if k ∼ r. In fact, these arguments go along the lines with the similar results of Foster & George (1994) and Birgé & Massart (2001. Recall that binomial prior corresponds to linear penalties of the type P en(k) = 2σ 2 λk in (2) (see Section 2). Foster & George (1994) and Birgé & Massart (2001, Section 5.2) showed that the best possible risk of such estimators over M p 0 is only of order σ 2 p 0 ln p achieved for λ ∼ ln p corresponding to the RIC criterion. It is of the same order as the optimal risk σ 2 p 0 (ln(p/p 0 ) + 1) for p 0 ≪ p (sparse case) but larger for dense case (p 0 ∼ p). On the other hand, the risk of the AIC estimator (λ = 1) is of the order σ 2 r, which is optimal for dense but much larger for sparse case. Finally, note that for the nearly-orthogonal design, ||X pβ pM − X p β p || ≍ ||β pM − β p ||, where "≍" means that their ratio is bounded from below and above. Therefore, all the results of Corollary 1 for estimating the mean vector X p β p in (11) can be straightforwardly applied for estimating the regression coefficients β p . This equivalence, however, does not hold for the multicollinear design considered below.

Multicollinear design
Nearly-orthogonality assumption may be reasonable in the "classical" setup, where p is not too large relatively to n but might be questionable for the analysis of high-dimensional data, where p ≫ n, due to the multicollinearity phenomenon (see also Remark 1). When this assumption does not hold, the sparse eigenvalues ratios in (10) may tend to zero as p increases and, thus, decrease the minimax lower bound rate relatively to the nearly-orthogonal design. In this case there is a gap between the rates in the lower and upper bounds (10) and (9). Intuitively, one can think of exploiting correlations between predictors to reduce the size of a model (hence, to decrease the variance) without paying much extra price in the bias term, and, therefore, to reduce the risk. We show that under certain additional assumptions on the design and the coefficients vector in (11), the upper risk bound (9) can be indeed reduced to the minimax lower bound rate (10).
For simplicity of exposition we consider the sparse case p 0 ≤ r/2 although the corresponding conditions for the dense case r/2 ≤ p 0 ≤ r can be obtained in a similar way with necessary changes.
We introduce now several definitions that will be used in the sequel  vector of coefficients β p in (11): for some positive constantsc 1 ,c 2 ,c 3 andc 4 .
Note that by simple algebra one can verify that φ p,min [2k] ·φ p [k] ≤ 1 and, therefore, the constantc 3 in Assumption (D.3) is not larger than one.
We have argued that multicollinearity typically arises when p ≫ n. One can easily verify that for n = O(p α ), 0 ≤ α < 1, Assumption (D.2) always follows from Assumption (D.1) and, therefore, can be omitted in this case.

Concluding remarks
In this paper we considered a Bayesian approach to model selection in Gaussian linear regression. From a frequentist view, the resulting MAP model selector is a penalized least squares estimator with a complexity penalty associated with a prior π(·) on the model size. Although the proposed estimator was originated within Bayesian framework, the latter was used as a natural tool to obtain a wide class of penalized least squares estimators with various complexity penalties. Thus, we believe that the main take-away messages of the paper summarized below are of a more general interest.
The first main take-away message is that neither linear complexity penalties (e.g., AIC, BIC and RIC) corresponding to binomial priors π(·), nor closely related Lasso and Dantzig estimators can be simultaneously minimax for both sparse and dense cases. We specify the class of priors and associated nonlinear penalties that do yield such a wide adaptivity range.
In particular, it includes 2k ln(p/k)-type penalties.
Another important take-away message is about the effect of multicollinearity of design. Unlike model identification or coefficients estimation, where multicollinearity is a "curse", it may become a "blessing" for estimating the mean vector allowing one to exploit correlations between predictors to reduce the size of a model (hence, to decrease the variance) without paying much extra price in the bias term. Interestingly, a similar phenomenon occurs in a testing setup (e.g., Hall & Jin, 2010).