Model selection in regression under structural constraints

The paper considers model selection in regression under the additional structural constraints on admissible models where the number of potential predictors might be even larger than the available sample size. We develop a Bayesian formalism as a natural tool for generating a wide class of model selection criteria based on penalized least squares estimation with various complexity penalties associated with a prior on a model size. The resulting criteria are adaptive to structural constraints. We establish the upper bound for the quadratic risk of the resulting MAP estimator and the corresponding lower bound for the minimax risk over a set of admissible models of a given size. We then specify the class of priors (and, therefore, the class of complexity penalties) where for the"nearly-orthogonal"design the MAP estimator is asymptotically at least nearly-minimax (up to a log-factor) simultaneously over an entire range of sparse and dense setups. Moreover, when the numbers of admissible models are"small"(e.g., ordered variable selection) or, on the opposite, for the case of complete variable selection, the proposed estimator achieves the exact minimax rates.


Introduction
Consider the standard Gaussian linear regression model where y ∈ R n is a vector of the observed response variable Y , X n×p is the design matrix of the p explanatory variables (predictors) X 1 , . . . , X p , β ∈ R p is a vector of unknown regression coefficients, ǫ ∼ N (0, σ 2 I n ) and the noise variance σ 2 is assumed to be known.
A variety of statistical applications of regression models in different fields nowadays involves a vast number of potential predictors. Moreover, p might be even large relative to the amount of available data n (p ≫ n setup) that raises a severe "curse of dimensionality" problem. However, typically only some of the predictors have a truly relevant impact on the response y. Model (variable) selection by identifying the "best" sparse subset of these "significant" predictors becomes therefore crucial in the analysis of such large data sets. For a selected model (a subset of predictors) M , the corresponding coefficients β M are then typically estimated by least squares.
The goodness of model selection depends on the particular goal at hand. One should distinguish, for example, between estimation of regression coefficients β, estimation of the mean vector Xβ, model identification and predicting future observations. Different goals may lead to different optimal model selection procedures especially when the number of potential predictors p might be much larger than the sample size n. In this paper we consider mainly the estimation of the mean vector Xβ and the goodness of a model M is measured by the quadratic risk E||Xβ M −Xβ|| 2 , whereβ M is the least squares estimate of β for M . The "best" model then is the one with the minimal quadratic risk. Note that the true underlying model in (1) is not necessarily the best in this sense since sometimes it is possible to reduce its risk by excluding predictors with small (but still nonzero!) coefficients.
Minimum quadratic risk criterion for model selection is evidently impossible to implement since it involves the unknown true β but the corresponding ideal minimal (oracle) risk can be used as a benchmark for any available model selection procedure. Typical model selection criteria are based on minimizing the empirical quadratic risk ||y − Xβ M || 2 , which is the least squares, penalized by a complexity penalty P en(|M |) increasing with a model size |M |: The properties of the resulting penalized least squares estimator depend obviously on the particular choice of the complexity penalty P en(·) in (2). The most commonly used choice is a linear type penalty of the form P en(k) = 2σ 2 λk for some λ > 0. The most known examples include C p (Mallows, 1973) and AIC (Akaike, 1973) for λ = 1, BIC (Schwarz, 1978) for λ = (ln n)/2 and RIC (Foster & George, 1994) for λ = ln p. A series of recent works proposed the so-called 2k ln(p/k)-type nonlinear complexity penalties of the form P en(k) = 2σ 2 λk(ln(p/k) + is of a full rank, i.e. rank(X) = min(p, n)). They also showed that for 2k ln(p/k)-type penalties, the resulting penalized estimators (2) are simultaneously asymptotically optimal (in the minimax sense) for the entire range of sparse and dense models, while linear penalties cannot achieve such a wide optimality range.
So far, the above minimax properties of various model selection procedures in (1)  for the minimax quadratic risk over a set of models of a given size under structural constraints.
We then specify the class of priors (and, therefore, the class of the corresponding complexity penalties), where for the "nearly-orthogonal" design, the resulting MAP estimators asymptotically are at least nearly-minimax (up to a log-factor) simultaneously over the wide range of sparse and dense models. In particular, when the numbers of admissible models are "small" (e.g., ordered variable selection) and for complete variable selection they lead to AIC-type and 2k ln(p/k)-type estimators respectively and achieve the exact minimax rate.
The paper is organized as follows. In Section 2 we develop a Bayesian formalism for model selection in regression under structural constraints and derive the MAP estimator. Its theoretical properties are presented in Section 3. In particular, we establish the upper bound for its quadratic risk, the corresponding lower bound for the minimax risk and discuss its asymptotic minimaxity in various setups. Section 4 presents the results of a simulation study. Concluding remarks are summarized in Section 5, while all the proofs are given in the Appendix.

MAP model selection procedure under structural constraints
We first extend the Bayesian formalism for the model selection in linear regression developed by Abramovich & Grinshtein (2010) to structural constraints. We assume that the latter are known and the set of all admissible models is therefore fixed.
Consider the linear regression model (1), where the number of possible predictors p might be even larger then the number of observations n. Let r = rank(X)(≤ min(p, n)) and assume that any r columns of X are linearly independent. For the "standard" linear regression setup, where all p predictors are linearly independent and there are at least p linearly independent design points, Let m(p 0 ) be the number of all admissible models of size p 0 . The case p 0 = 0 corresponds to a null model with a single intercept and, therefore, m(0) = 1. In fact, we can consider only p 0 ≤ r since otherwise, there necessarily exists another vector β * with at most r nonzero entries such that Xβ = Xβ * . Although for p 0 = r there may be several different admissible models, all of them are evidently undistinguishable for estimating Xβ and can be associated with a single (saturated) model. Thus, without loss of generality, we can always assume that m(r) = 1. Obviously, for where the two extreme cases m(p 0 ) = 1 and m(p 0 ) = p p 0 for all p 0 = 0, . . . , r − 1, correspond respectively to the ordered and complete variable selection.
For m(k) > 0, we assume all m(k) admissible models of a given size k to be equally likely, that is, conditionally on the model size |M | = k, where recall that m(r) = 1 and hence P (M |M | = r) = 1. To complete the prior, for any given model M we assume the normal prior on its unknown coefficients β M : Zellner (1986).
For the proposed hierarchical prior, straightforward Bayesian calculus yields the posterior probability of a model M : Finding the most likely model leads therefore to the following maximum a posteriori (MAP) model selection criterion: which is of the general type (2) with the complexity penalty A specific form of the penalty (5) is defined by the choice of the prior π(·) on the model size. Obviously, if a model M ∈ M p 0 , the corresponding coefficients vector β M ∈ R p has p 0 nonzero admissible models with at most p 0 predictors. Following our arguments from Section 2, Ω can be essentially reduced to M r .
In this section we derive the upper and lower bounds for the maximal risk of the proposed MAP model selector (4) over M p 0 .
Then, there exists a constant C(γ) > 0 depending only on γ such that simultaneously for all 1 ≤ p 0 ≤ r.
To assess the accuracy of the established upper bound for the quadratic risk of the proposed MAP estimator, we derive the lower bound for the minimax risk of estimating Xβ in (1).
The l 0 quasi-norm ||β|| 0 of a vector β is defined as the number of its nonzero entries. For any given k = 1, . . . , r, let φ min [k] and φ max [k] be the k-sparse minimal and maximal eigenvalues of the design defined as In fact, φ min [k] and φ max [k] are respectively the minimal and maximal eigenvalues of all k × k submatrices of the matrix X ′ X generated by any k columns of X.
is a non-increasing function of k. Obviously, τ [k] ≤ 1 and for the orthogonal design the equality holds for all k.
Theorem 2 (minimax lower bound). Consider the model (1) and let 1 ≤ p 0 ≤ r. There exists a universal constant C > 0 such that where the infimum is taken over all estimatesỹ of the mean vector Xβ.
In some particular cases, e.g. for complete variable selection, the general minimax lower bounds established in Theorem 2 can be improved by removing the max(1, ln p 0 )-term in (9) (7) and (9) is anyway p 0 .
The upper bound (7) holds for any design matrix X of rank r, while the minimax lower bound (9) depends on X but only through the sparse eigenvalues ratios. Finally note that the structural constraints are manifested in the upper and lower bounds only through m(p 0 ) -the number of admissible models of size p 0 .

Asymptotic adaptive minimaxity of the MAP estimator
The established upper and lower risk bounds (7), (9) in the previous Section 3.1 is the key for investigating the asymptotic minimaxity of the proposed MAP estimator, where the number of possible predictors p = p n may increase with the sample size n. One can view such a setup as a series of projections of the vector Xβ on the expanding span of predictors. In particular, it may be p n > n or even p n ≫ n. Thus, formally, we consider now a sequence of design matrices X p,n , where r n = rank(X n,p ) → ∞. For simplicity of exposition, hereafter, we omit the index n. Similarly, there are sequences of the coefficient vectors β p and priors π p (·). In these notations the original model (1) is transformed into a sequence of models where rank(X p ) = r and any r columns of X p are linearly independent (hence, τ p [r] > 0), ǫ ∼ N (0, σ 2 I n ) and the noise variance σ 2 does not depend on n and p.
We consider the nearly-orthogonal design, where the sequence of sparse eigenvalues ratios τ p [r] is bounded away from zero. Nearly-orthogonality means that there are no "too strong" linear relationships within any set of r columns of the design matrix X p . Evidently, in this case p cannot be "too large" relative to r and, therefore, to n. Corollary 1 (bounds for the minimax risk). Let the design be nearly-orthogonal. There exist two constants 0 < C 1 ≤ C 2 < ∞ such that for all sufficiently large r, Corollary 2 (asymptotic adaptive minimaxity of the MAP estimator). Consider the nearlyorthogonal design and assume that for m(k) > 0 the prior π(k) satisfies min{m(k) −c , e −ck } ≤ π(k) ≤ m(k)e −c(γ)k , k = 1, . . . , r for some c > 0 and c(γ) defined in Theorem 1. Then the corresponding MAP estimator (4) is asymptotically at least nearly-minimax (up to a ln p 0 -factor) simultaneously over all M p 0 , 1 ≤ p 0 ≤ r.
The above general results depend on the asymptotic behavior of m(p 0 ) as a function of p 0 .
The resulting MAP estimator (4) attains then the minimax rates simultaneously over all Note that σ 2 p 0 is the risk of (unbiased) least squares estimation of Xβ M 0 for a true model M 0 of size p 0 in (1). Corollary 3 therefore verifies that when the number of admissible models is small, there is essentially no extra price for model selection.
It follows from (5) that priors satisfying the conditions of Corollary 3 lead to the AIC-type penalties of the form P en(k) ∼ 2C(γ)σ 2 k for some C(γ) > 1.  The conditions on the prior of Corollary 4 hold, for example, for the truncated geometric prior π(k) ∝ q k , k = 1, . . . , r for some 0 < q < 1, and the corresponding penalties in (5) are of the 2k ln(p/k)-type, where P en(k) = 2C(γ)σ 2 k(ln(p/k) + 1)(1 + o(1)) for some C(γ) > 1. Obviously, m(k) < p k , k = 1, . . . , r − 1. On the other hand, this is an example with "large" numbers of admissible models, where ln m(k) ≥ ck ln(p/k) for some 0 < c < 1. Indeed, for models of sizes one and two only main effects can be selected, and the numbers of admissible models are m(1) = K and m(2) = K 2 respectively. One can trivially verify that in both cases ln m(k) ≥ ck ln(p/k), k = 1, 2 for some positive constant c < 1. For k ≥ 3, we have the following lemma: the resulting MAP estimator is asymptotically at least nearly-minimax (up to ln p 0 -factor) over all M p 0 , 1 ≤ p 0 ≤ r and can only conjecture that, similar to the complete variable selection, this log-factor can be removed in this case as well.
Similar to complete variable selection, when the numbers of admissible models for the intermediate case are "large" (e.g., hierarchical model selection with main effects and paired interactions considered above), the conditions on the prior in Corollary 2 are satisfied for the truncated geometric prior π(k) ∝ q k , k = 1, . . . , r for some 0 < q < 1 corresponding to complexity penalties of 2k ln(p/k)-type.
Note also that for the nearly-orthogonal design, ||X pβ pM − X p β p || ≍ ||β pM − β p || and all the results of this section for estimating the mean vector X p β M can be therefore straightforwardly applied to estimating the regression coefficients β M .

Simulation study
We conducted a short simulation study to demonstrate the performance of the proposed MAP model selection procedure. We considered polynomial regression which is an example of the ordered variable selection (see Section 1): where 0 ≤ k ≤ n − 1 is the polynomial degree to be selected, and ǫ i ∼ N (0, σ 2 ) with the known variance σ 2 and independent. In this case obviously m(k) = 1 for all k = 0, . . . , n − 1. An example of a prior satisfying the conditions of Corollary 3 is the (truncated) geometric prior Geom(1 − q), where π(k) = (1 − q)q k /(1 − q n ) ∝ q k , k = 0, . . . , n − 1 for some 0 < q < 1. The corresponding complexity penalty (5) is the AIC-type linear penalty P en(k) = 2σ 2 (1 + 1/γ) ln q −1 1 + γ k (12)

Estimation of parameters
To apply the developed MAP model selection procedure we need to specify the prior parameters γ and q in (12). They are rarely known a priori in practice and usually should be estimated from the data.
Let X n×n be the design matrix of the saturated model, that is, where H k = XD k (D k X ′ XD k ) + D k X ′ , and straightforward calculus yields the following marginal likelihood of the observed data y: The MLEs for γ and q can be obtained (numerically) by the EM algorithm. Regard k as a "missing" data and define the corresponding latent indicator variables u j = δ jk , j = 0, . . . , n − 1.
The complete log-likelihood for the "augmented" data (y, u), up to an additive constant, is then On the E-step at the h-th iteration we compute the conditional expectation whereû At the M-step we maximizel [h] w.r.t. γ and q to get There is no closed form solution forq [h+1] . However, replacing the truncated geometric distribution by a usual one and ignoring thus the last term ln(1 − q n ) in the RHS of (13), implies a good approximation ofq [h+1] for large n:q

The results
We used two test functions: a fifth degree polynomial  N (0, σ 2 ) and independent. The noise variance σ 2 was chosen to ensure the signal-to-noise ratio SNR at levels 3, 5 and 7 and was assumed to be known. The number of replications was 100. The proposed MAP model selector results in a model selection procedure with a linear type penalty of the form P en(k) = 2σ 2 λk with λ M AP = (1 + 1/γ) ln(q −1 √ 1 + γ) (see (12)). We compared it with two other well-known model selection procedures with linear penalties, namely, AIC (λ AIC = 1) and RIC (λ RIC = ln p) (see Section 1). In our case, λ RIC = ln 100 = 4.6. Table 1 summarizes mean squared errors averaged over 100 replications (AMSE). We also present the average polynomial degrees selected by the three methods for approximating the true response functions. As expected, a more conservative RIC tends to include less terms in the model and outperforms AIC for a polynomial g 1 , while the latter is superior for g 2 , where a high order polynomial approximation is required. The MAP estimator with estimated γ and q yields a data-driven λ and is adaptive to the unknown polynomial degree -it behaves very similar to RIC when it is low and to AIC when it is high.

Concluding remarks
In this paper we considered model selection in linear regression under general structural constraints and extended the existing results for the complete variable selection. In particular, we utilized a Bayesian MAP model selection procedure of Abramovich & Grinshtein (2010) and modified it correspondingly. From a frequentist view, the resulting MAP model selector is a penalized least squares estimator with a complexity penalty associated with a prior on the model size which is adaptive to the structural constraints. In fact, the proposed Bayesian approach can be used as a tool for generating a wide class of penalized least squares estimators with various complexity penalties.
We established the general upper bound for the quadratic risk of the MAP estimator over a set of admissible models of a given size and the lower bound for the corresponding minimax risk. Based on these results, we showed that for the nearly-orthogonal design, the MAP estimator is asymptotically at least nearly minimax (up to a log-factor) for the entire range of sparse and dense models.
Moreover, when the numbers of admissible models are "small" or, on the opposite, for the case of complete variable selection, it achieves the exact minimax rates. The corresponding MAP model selection procedures lead respectively to AIC-type and 2k ln(p/k)-type criteria. Whether these results on asymptotic minimaxity are true for the intermediate case remains so far a conjecture.
There are also other challenges for future research. The assumption of nearly-orthogonal design used in investigating asymptotic minimaxity of MAP estimators in Section 3.2 typically does not hold for p ≫ n setup due to the multicollinearity phenomenon. The analysis of multicollinear design, where the sequence of sparse eigenvalues ratios τ p [r] may tend to zero as p increases is much more delicate. In this case there is a gap (in addition to a log-factor) between the rates in the upper and lower bounds (7) and (9) The key point is that the relevant models with the highest posterior probabilities will appear most frequently and can be identified even for a generated sample of a relatively small size avoiding computations of the entire posterior distribution. However, most of the above approaches have been developed and studied for complete variable selection. Their adaptation to minimization of (4) subject to the additional structural constraints while remaining computationally feasible should depend on the specific type of constraints at hand. In particular, for somewhat different priors, Chipman (1996)

Appendix
Throughout the proofs we use C to denote a generic positive constant, not necessarily the same each time it is used, even within a single equation.
The condition (15) follows immediately from the definition of L k : Consider now (16) which is equivalent to the inequality Repeating the calculus in the proof of Theorem 1 of Abramovich, Grinshtein & Pensky (2007), one verifies that with the upper bound on the prior π(k) in (6), for C = 1 + 1/(2γ), (17) and therefore (16) are satisfied for all L k , k = 1, . . . , r.
On the other hand, for p 0 < r from (18) and (8)  Combining (19) and (20) completes the proof. Consider the corresponding subset of mean vectors G p 0 , where card(G p 0 ) = card(B p 0 ). For any g 1 , g 2 ∈ G p 0 and the associated with them β 1M , β 2M ∈ B p 0 we then have On the other hand, by similar arguments, the Kullback-Leibler divergence satisfies Set now Then, (28) and (29) yield ||g 1 −g 2 || 2 ≥ τ [2p 0 ]σ 2 α ln m(p 0 )/(8C), K(P g 1 , P g 2 ) ≤ (1/16) ln card(G p 0 ),  1. 3 ≤ k ≤ 3 2 K Consider first all models that include any ⌊ k 3 ⌋ paired interactions and the corresponding main effects. Evidently, the size of any such a model is at most k. If it is less than k (it happens when the same main effect appears in several interactions), we complete it to k by adding other main effects from the remaining ones (one can easily verify that there are enough remaining main effects since k − ⌊ k 3 ⌋ ≤ K). We then have m(k) ≥ K(K−1) 2 ⌊ k 3 ⌋ and, therefore, by straightforward calculus, ln m(k) ≥ k 3 ln K(K − 1) 2⌊ k 3 ⌋ ≥ k 3 ln K(K + 1) 2k = k 3 ln(p/k) 2. 3 2 K < k ≤ r − 1 For this case we consider models of size k that include all K main effects and any k − K paired interactions. Thus, To complete the proof one can easily verify that K(K−1) 2(k−K) ≥ K(K+1)