On the Post Selection Inference constant under Restricted Isometry Properties

Uniformly valid confidence intervals post model selection in regression can be constructed based on Post-Selection Inference (PoSI) constants. PoSI constants are minimal for orthogonal design matrices, and can be upper bounded in function of the sparsity of the set of models under consideration, for generic design matrices. In order to improve on these generic sparse upper bounds, we consider design matrices satisfying a Restricted Isometry Property (RIP) condition. We provide a new upper bound on the PoSI constant in this setting. This upper bound is an explicit function of the RIP constant of the design matrix, thereby giving an interpolation between the orthogonal setting and the generic sparse setting. We show that this upper bound is asymptotically optimal in many settings by constructing a matching lower bound.


Introduction
Fitting a statistical model to data is often preceded by a model selection step. The construction of valid statistical procedures in such post model selection situations is quite challenging (cf. [21,22,23], [17] and [25], and the references given in that literature), and has recently attracted a considerable amount of attention. Among various recent references in this context, we can mention its theoretical comparison with [18] and its applicability. In Section 4 we provide the lower bound and the optimality result for the upper bound. All the proofs are given in the appendix.

PoSI confidence intervals
We consider and review briefly the framework introduced by [6] for which the so-called PoSI constant plays a central role. The goal is to construct post-model selection confidence intervals that are agnostic with respect to the model selection mehod used. The authors of [6] assume a Gaussian vector of observations where the n×1 mean vector μ is fixed and unknown, and follows the N (0, σ 2 I n ) distribution where σ 2 > 0 is unknown. Consider an n × p fixed design matrix X, whose columns correspond to explanatory variables for μ. It is not necessarily assumed that μ belongs to the image of X or that n ≥ p. . . , p}} is supposed to be given. Following [6], for any M ∈ M, the projection based vector of regression coefficients β M is a target of inference, with β M := Arg Min where X M is the submatrix of X formed of the columns of X with indices in M , and where we assume that for each M ∈ M, X M has full rank and M is nonempty. We refer to [6] for an interpretation of the vector β M and a justification for considering it as a target of inference. In [6], the different quantities involved, which we now define, are standard ingredients for univariate confidence intervals for regression coefficients in the Gaussian model, except for the last factor (the "PoSI constant") which will account for multiplicity of covariates and models, and their simultaneous coverage. The confidence interval is centered atβ M := (X t M X M ) −1 X t M Y , the ordinary least squares estimator of β M ; also, if M = {j 1 , . . . , j |M | } with j 1 < . . . < j |M | , for i ∈ M we denote by i.M the number k ∈ N for which j k = i, that is, the rank of the i-th element in the subset M . The quantityσ 2 is an unbiased estimator of σ 2 , more specifically it is assumed that it is an observable random variable, such thatσ 2 is independent of P X Y and is distributed as σ 2 /r times a chi-square distributed random variable with r degrees of freedom (P X denoting the orthogonal projection onto the column space of X). We allow for r = ∞ corresponding toσ = σ, i.e., the case of known variance (also called Gaussian limiting case). In [6], it is assumed thatσ exists and it is shown that this indeed holds in some specific situations. A further analysis of the existence ofσ is provided in [1,2].
The next quantity to define is where e b a is the a-th base column vector of R b ; and G M := X t M X M is the |M | × |M | Gram matrix formed from the columns of X M . Observe that v M,i is nothing more than the row corresponding to covariate i in the estimation matrix is called a PoSI constant and we turn to its definition. We shall occasionally write for simplicity K(X, M, α, r) = K(X, M). Furthermore, if the value of r is not specified in K(X, M), it is implicit that r = ∞. Definition 2.1. Let M ⊂ M all for which each M ∈ M is non-empty, and so that X M has full rank. Let also Let ξ be a Gaussian vector with zero mean vector and identity covariance matrix on R n . Let N be a random variable, independent of ξ, and so that rN 2 follows a chi-square distribution with r degrees of freedom. If r = ∞, then we let N = 1. For α ∈ (0, 1), K(X, M, α, r) is defined as the 1 − α quantile of We remark that K(X, M, α, r) is the same as in [6]. For j = 1, . . . , p, let X j be the column j of X. We also remark, from [6], that the vector v M,i / v M,i 2 in (4) is the residual of the regression of X i with respect to the variables {j|j ∈ M \ {i}}; in other words, it is the component of the vector X i orthogonal to Span{X j |j ∈ M \ {i}}. It is shown in [6] that we have, with probability larger than 1 − α, Hence, the PoSI confidence intervals guarantee a simultaneous coverage of all the projection-based regression coefficients, over all models M in the set M. For a square symmetric non-negative matrix A, we let where diag(A) is obtained by setting all the non-diagonal elements of A to zero and where B † is the Moore-Penrose pseudo-inverse of B. Then we show in the following lemma that K(X, M) depends on X only through corr(X t X).

Lemma 2.2.
Let X and Z be two n × p and m × p matrices satisfying the relation corr(X t X) = corr(Z t Z). Then K(X, M, α, r) = K(Z, M, α, r).

Order of magnitude of the PoSI constant
The confidence intervals in (3) are similar in form to the standard confidence intervals that one would use for a single fixed model M and a fixed i ∈ M . For a standard interval, K(X, M) would be replaced by a standard Gaussian or Student quantile. Of course, the standard intervals do not account for multiplicity and do not have uniform coverage over i ∈ M ∈ M (see [1,2]). Hence K(X, M) is the inflation factor or correction over standard intervals to get uniform coverage; it must go to infinity as p → ∞ [6]. Studying the asymptotic order of magnitude of K(X, M) is thus an important problem, as this order of magnitude corresponds to the price one has to pay in order to obtain universally valid post model selection inference. We now present the existing results on the asymptotic order of magnitude of K(X, M). Let us define so that γ M,r = γ M,∞ /N , where we recall that rN 2 follows a chi-square distribution with r degrees of freedom. We can relate the quantiles of γ M,r (which coincide with the PoSI constants K(X, M)) to the expectation E[γ M,∞ ] by the following argument based on Gaussian concentration (see Appendix A): Proposition 2.3. Let T (μ, r, α) denote the α-quantile of a noncentral T distribution with r degrees of freedom and noncentrality parameter μ. Then To be more concrete, we observe that we can get a rough estimate of the latter quantile via furthermore, as r → +∞, this quantile reduces to the (1 − α/2) quantile of a Gaussian distribution with mean E[γ M,∞ ] and unit variance.
The point of the above estimate is that the dependence in the set of models M is only present through E[γ M,∞ ]. Therefore, we will focus in this paper on the problem of bounding E[γ M,∞ ], which is nothing more than the Gaussian width [15, chapter 9] is no smaller than 2 log(2p) and asymptotically no larger than √ p. These two lower and upper bound are reached by respectively orthogonal design matrices and equi-correlated design matrices (see [6]). We now concentrate on s-sparse models. For s ≤ p, let us define M s = {M |M ⊂ {1, . . . , p}, |M | ≤ s}. In this case, using a direct argument based on cardinality, one gets the following generic upper bound (proved in Appendix B).

Lemma 2.4. For any s, n, p ∈ N, with s ≤ n, we have
We remark that an asymptotic version of the bound in Lemma 2.4 (as p and s go to infinity) appears in an intermediary version of [32].

Main result
We recall the definition and a property of the RIP constant κ(X, s) associated to a design matrix X and a sparsity condition s given in [15,Chap.6]: Letting κ = κ(X, s), we have for any subset M ⊂ {1, . . . , p} such that |M | ≤ s: Remark 3.1. The RIP condition may also be stated between norms instead of squared norms in (10). Following [15, Chap.6] we will consider the formulation in terms of squared norms, which is more convenient here. Since the PoSI constant K(X, M) only depends on corr(X t X) (see Lemma 2.2), we shall rather consider the RIP constant associated to corr(X t X). We let Any upper bound for κ(X, s) yields an upper bound for δ(X, s) as shown in the following lemma.
The next theorem is the main result of the paper. It provides a new upper bound on the PoSI constant, under RIP conditions and with sparse submodels. We remark that in this theorem, we do not necessarily assume that n ≥ p.
This upper bound is of the form where: • U orth (p) = 2 log(2p) is the upper bound in the orthogonal case; • U sparse (p, s) is the right-hand side of (8) corresponding to the cardinalitybased upper bound in the sparse case; then U RIP is even asymptotically equivalent to U orth . In particular, this is the case if δ √ s → 0. We now consider the specific case where X is a subgaussian random matrix, that is, X has independent subgaussian entries [15, Definition 9.1]. We discuss in which situations δ = δ(X, s) → 0. The estimate of κ in [15, Theorem 9.2] combined with Lemma 3.2 yields so that δ → 0 as soon as n/(s log(ep/s)) → +∞.

Comparison with upper bounds based on Euclidean norms
We now compare our upper bound in Theorem 3.3 to upper bounds recently and independently obtained in [18]. Recall the notation Y , μ, β M andβ M from Section 2 and let r = ∞ for simplicity of exposition. The authors in [18] address the case where X is random (random design) and consider de- , the population version of the regression coefficients β M , assuming that the rows of X are independent random vectors in dimension p. They derive uniform bounds over M ∈ M s for β M −β M 2 . They also consider briefly (Remark 4.3 in [18]) the fixed design as in the present paper. This target β M can be interpreted as the random design model conditional to X. They assume that the individual coordinates of X and Y have exponential moments bounded by a constant independently from n, p (thus their setting is more general than the Gaussian regression setting, but for the purpose of this discussion we assume Gaussian noise).
Let us additionally assume that the RIP property κ(X/ √ n, s) ≤ κ is satisfied (on an event of probability tending to 1) and for κ restricted to a compact of [0, 1) independently of n, p; note that we used the rescaling of X by √ n, which is natural in the random design case. Then some simple estimates obtained as a consequence of Theorems 1 3.1 and 4.1 in [18] Thus, if δ = Ω(1), since the Euclidean norm upper bounds the supremum norm, the results of [18] imply ours (at least in the sense of these asymptotic considerations). On the other hand, in the case where δ → 0, which is the case we are specifically interested in, we obtain a sharper bound (in the weaker supremum norm).
In particular, if X is a subgaussian random matrix (as discussed in the previous section), due to (12) This improves over the estimate deduced from (13) as soon as s log(ep/s) = o(n), which corresponds to the case where (13) tends to 0. Conversely, in this situation our bound (15) yields for the Euclidean norm (using w 2 ≤ w 0 w ∞ ): Assuming s = O(p λ ) for some λ < 1 for ease of interpretation, we see that (16) is of the same order as (13) when s 2 log(p) = O(n), and is of a strictly larger order otherwise. In this sense, it seems that (14) and (13) are complementary to each other since we are using a weaker norm, but obtain a sharper bound in the case δ → 0.

Applicability
While the main interest of our results is theoretical, we now discuss the applicability of our bound. For any δ ≥ δ(X, s), Theorem 3.3 combined with Proposition 2.3 provides a bound of the form U RIP (p, s, δ) ≥ K(X, M s ), with U RIP (p, s, δ) = T 2 log(2p) + 2δ This bound can be used in practice in situations where δ(X, s) (or an upper bound of it) can be computed, whereas K(X, M s ) cannot because the number of inner products in (5) is too large. Indeed, for a given δ, it is immediate to compute U RIP (p, s, δ).
Upper bounding the RIP constant When n ≥ p, we have δ(X, s) ≤ δ(X, p) and δ(X, p) can be computed in practice for a given X. Specifically, δ(X, p) is the largest eigenvalue of corr(X t X) − I p in absolute value. When X is a subgaussian random matrix, δ(X, p) ∼ p/n [3,24]. Thus, if n is large enough compared to p, the computable upper bound U RIP (p, s, δ(X, p)) will improve on the sparsity-based upper bound U sparse (p, s) = T ((2s log(6p/s)) 1/2 , r, 1−α/2) ≥ K(X, M s ), see Proposition 2.3 and Lemma 2.4. On the other hand, when n < p, it is typically too costly to compute δ(X, s) (or an upper bound of it) for a large p. Nevertheless, if one knows that X is a subgaussian random matrix, they can compute an upper boundδ satisfying δ ≥ δ(X, s) with high probability, as in [15,Chapter 9]. We remark that using the values ofδ currently available in the literature, one would need n to be very large for U RIP (p, s,δ) to improve on U sparse (p, s).

Alternative upper bound on the PoSI constant For any δ ≥ δ(X, s),
we now show how to compute an alternative bound of the formŨ RIP (p, s, δ) ≥ K(X, M s ). Our numerical experiments suggest that this alternative bound is generally sharper than U RIP (p, s, δ). For q, r, ρ ∈ N and ∈ (0, 1), let B (q, r, ρ) be defined as the smallest t > 0 so that where G 2 /q follows a Fisher distribution with q and r degrees of freedom, and F Beta,a,b denotes the cumulative distribution function of the Beta(a, b) distribution. In the case r = +∞, B is also defined and further described in [2, Section 2.5.2]. It can be seen from the proof of Theorem 3.3 (see specifically (22) which also holds without the expectation operators), and from the arguments in [1], that we have for any t ∈ (0, 1). This upper bound can be minimized with respect to t, yielding U RIP (p, s, δ).
The quantity B (q, r, ρ) can be easily approximated numerically, as it is simply the quantile of the tail distribution H q,ρ , which only involves standard distributions. Algorithm E.3 in the supplementary materials of [1] can be used to compute B (q, r, ρ). An implementation of this algorithm in R [26] is available in Appendix C. Hence, the upper boundŨ RIP (p, s, δ) can be computed for large values of p for a given δ.

Equi-correlated design matrices
The goal of this section is to find a matching lower bound for Theorem 3.3. For this we extend ideas of [6, Example 6.2] and, following that reference, we restrict our study to design matrices X for which n ≥ p. The lower bound is based on the p × p matrix Z (c,k) = (e p 1 , e p 2 , . . . , e p p−1 , x k (c)), where where we assume k < p, and the constant c satisfies c 2 < 1/k, so that Z (c,k) has full rank. By definition, the correlation between any of the first k columns of Z (c,k) and the last one is c, and Z (c,k) restricted to its first p − 1 columns is the identity matrix I p−1 . The case where k = p − 1 is studied in [6, Example 6.2]: Theorem 6.2 in [6] implies that the PoSI constant K(X, M), where X is a n × p matrix such that where [a] means that all the entries of the corresponding block are identical to a. We begin by studying the RIP coefficient δ(X, s) for design matrices X yielding the Gram matrix (17). Since this Gram matrix has full rank p, there exists a design matrix satisfying this condition if and only if n ≥ p.
Lemma 4.1. Let X be a n × p matrix for which X t X is given by (17) with

A matching lower bound
In the following proposition, we provide a lower bound of K(X, M s ) for matrices X yielding the Gram matrix (17).

Proposition 4.2.
For any s ≤ k < p, c 2 < 1/k and α ≤ 1 2 , let X be a n × p matrix for which X t X is given by (17) with kc 2 < 1. We have where A > 0 is a universal constant.
From the previous lemma, we now show that the upper bound of Theorem 3.3 is optimal (up to a multiplicative constant) for a large range of behavior of s and δ relatively to p. As discussed after Theorem 3.3, in the case where δ √ s 1 − log s/ log p + 1/ log p = O(1), the upper bound we obtain is optimal, since it can be written as O( √ log p). In the next Corollary, we show that the upper bound of Theorem 3.3 is also optimal when δ √ s 1 − log s/ log p + 1/ log p tends to +∞, and when δ = O(p −λ ) for some λ > 0.
where B is a constant. Moreover, there exists a sequence of design matrices X p such that δ(X p , s p ) ≤ δ p and where A is a constant. In particular, if δ p = O(p −λ ) for some λ > 0 and if (p − 1)/s p ≥ 2, then the above upper and lower bounds have the same rate.
Therefore, the upper bound in Theorem 3.3 is optimal in most configurations of s p and δ p , except if δ p goes to 0 slower than any inverse power of p.

Concluding remarks
In this paper, we have proposed an upper bound on PoSI constants in s-sparse situations where the n × p design matrix X satisfies a RIP condition. As the value of the RIP constant δ increases from 0, this upper bound provides an interpolation between the case of an orthogonal X and an existing upper bound only based on sparsity and cardinality. We have shown that our upper bound is asymptotically optimal for many configurations of (s, δ, p) by giving a matching lower bound. In the case of random design matrices with independent entries, since δ decreases with n, our upper bound compares increasingly more favorably to the cardinality-based upper bound as n gets larger. It is also complementary to the bounds recently proposed in [18]. The interest and various applications of the RIP property are well-known in the high-dimensional statistics literature, in particular for statistical risk analysis or support recovery. Our analysis puts into light an additional interest of the RIP property for agnostic post-selection inference (uncertainty quantification).
The PoSI constant corresponds to confidence intervals on β M in (2). In section 3.2 we also mention another target of interest in the case of random X, . This quantity depends on the distribution of X rather than on its realization, which is a desirable property as discussed in [1,18] where the same target has also been considered. In [1], it is shown that valid confidence intervals for β M are also asymptotically valid forβ M , provided that p is fixed. These results require that μ belongs to the column space of X and hold for models M such that μ is close to the column space of X M . It would be interesting to study whether assuming RIP conditions on X enables to alleviate these assumptions.
The purpose of post-selection inference based on the PoSI constant K(X, M) is to achieve the coverage guarantee (6). The guarantee (6) implies that, for any model selection procedureM : R n → M, with probability larger than 1 − α, for all i ∈M , (M ) i.M ∈ CI i,M . Hence, there is in general no need to make assumptions about the model selection procedure when using PoSI constants. On the other hand, the RIP condition that we study here is naturally associated to specific model selection procedures, namely the lasso or the Dantzig selector [9,10,30,33]. Hence, it is natural to ask whether the results in this paper could help post-selection inference specifically for such procedures. We believe that the answer could be positive in some situations. Indeed, if the lasso model selector is used in conjunction with a design matrix X satisfying a RIP property, then asymptotic guarantees exist on the sparsity of the selected model [8]. Thus, one could investigate the combination of bounds on the size of selected models (of the form |M | ≤ S and holding with high probability) with our upper bound, by replacing s by S.
In the case of the lasso model selector, we have referred, in the introduction section, to the post-selection intervals achieving conditional coverage [19], specifically for the lasso model selector. These intervals are simple to compute (when the conditioning is on the signs, see [19]). Generally speaking, in comparison with confidence intervals based on PoSI constants, the confidence intervals of [19] have the benefit of guaranteeing a coverage level conditionally on the selected model. On the other hand the confidence intervals in [19] can be large, and can provide small coverage rates when the regularization parameter of the lasso is data-dependent [1]. It would be interesting to study whether these general conclusions would be modified in the special case of design matrices satisfying RIP properties.
Finally, the focus of this paper is on PoSI constants in the context of linear regression. Recently, [2] extended the PoSI approach to more general settings (for instance generalized linear models), provided a joint asymptotic normality property holds between model dependent targets and estimators. This extension was suggested in the case of asymptotics for fixed dimension and fixed number of models. In the high-dimensional case, an interesting direction would be to apply the results of [12], that provide Gaussian approximations for maxima of sums of high-dimensional random vectors. This opens the perspective of applying our results to various high-dimensional post model selection settings, beyond linear regression.
It is known that in certain situations one can expect an even tighter concentration, through the phenomenon known as superconcentration [11]. While such situations are likely to be relevant for the setting considered in this paper, we leave such improvements as an open issue for future work.
We use the previous property in our setting as follows: Proposition A.2. Let C be finite a family of unit vectors of R n , ξ a standard Gaussian vector in R n and N an independent nonnegative random variable so that rN 2 follows a chi-squared distribution with r degrees of freedom. Define the random variable γ C,r : Then the (1 − α) quantile of γ C,r is upper bounded by the (1 − α/2) quantile of a noncentral T distribution with r degrees of freedom and noncentrality parameter E[max v∈C |v t ξ|].
Proof. Observe that ξ → max v∈C |v t ξ| is 1-Lipschitz since the vectors of C are unit vectors. Therefore we conclude by Theorem A.1 that there exists a standard normal variable ζ (which is independent of N since N is independent of ξ) so that the following holds: We can represent the above right-hand side as max(T + , T − ) where i.e. T + , T − are two (dependent) noncentral t distributions with r degrees of freedom and noncentrality parameter E[max v∈C |v t ξ|]. Finally since we obtain the claim.
Since a noncentral distribution is (stochastically) increasing in its noncentrality parameter, any bound obtained for E[max v∈C |v t ξ|] will result in a corresponding bound on the quantiles of the corresponding noncentral T distribution and therefore of those of γ C . In the limit r → ∞, the quantiles of the noncentral T distribution reduce to those of a shifted Gaussian distribution with unit variance.
The claimed estimate follows.