Optimal Two-Step Prediction in Regression

High-dimensional prediction typically comprises two steps: variable selection and subsequent least-squares refitting on the selected variables. However, the standard variable selection procedures, such as the lasso, hinge on tuning parameters that need to be calibrated. Cross-validation, the most popular calibration scheme, is computationally costly and lacks finite sample guarantees. In this paper, we introduce an alternative scheme, easy to implement and both computationally and theoretically efficient.

1. Introduction. Prediction in high-dimensional regression often consists of two steps: variable selection to obtain a low-dimensional model followed by least-squares estimation to refine the model. Standard variable selection procedures such as the lasso [22], the square-root lasso or scaled lasso [1,3,19,20], thresholded ridge regression [18], and the thresholded lasso [27] hinge on tuning parameters; therefore, careful tuning parameter calibration is essential for good prediction. In particular, the least-squares refitting needs to be taken in account to obtain optimal prediction performance.
Tuning parameter selection has been extensively studied, but for highdimensional prediction, no computationally feasible scheme with theoretical guarantees is available. The most popular scheme is cross-validation, which is computationally expensive and difficult to study theoretically. Faster and theoretically more tractable schemes are unbiased risk estimation methods, such as Mallow's C p , Akaike's Information Criterion, or Stein's Unbiased Risk Estimation criterion [29]; however, these schemes require differentiability with respect to the data, and least-squares refitting typically renders estimators non-differentiable. In addition, these schemes have a limited scope, because they usually require strong assumptions on the distribution of the noise, such as having a normal or elliptical distribution. Schemes that can provide fast and optimal two-step calibration for a variety of variable selection procedures and settings are therefore of great interest.
In this paper, we introduce a novel scheme, Adaptive Validation for Prediction (AV p ), that provides fast and optimal calibration. Our proposal is related to the recently introduced ∞ -Adaptive Validation (AV ∞ ) scheme [6], which is based on tests inspired by isotropic versions of Lepski's method [5,13,14]. AV ∞ has been shown to provide fast and optimal calibration of the lasso for (one-step) estimation with respect to ∞ -loss.
Unfortunately, AV ∞ is not suited to the present task, which involves different procedures and a different loss, and therefore, despite relying on related testing strategies, AV p and AV ∞ stand on different bases. For example, AV p exploits novel prediction bounds for the lasso and ridge-regression, while AV ∞ is based on known estimation bounds for the lasso. Moreover, AV p involves sets and, therefore, requires an anisotropic framework instead of the simpler isotropic (one-dimensional) framework underlying the AV ∞ . Nevertheless, we can show that AV p is for our task -just as AV ∞ for lasso estimation -a fast and optimal calibration scheme that rivals standard schemes such as cross-validation in terms of speed and accuracy.
This article is divided as follows. Section 2: We present AV p , our novel calibration scheme for fast and optimal two-step prediction and describe its optimality properties. Section 3: We introduce guarantees for optimal calibration of a variety of variable selection procedures, with particular emphasis on the lasso and thresholded ridge regression. Section 4: We summarize the results of a thorough simulation study that illustrates the good performance of AV p for a variety of estimators and settings. To ease the presentation, the proofs of the different propositions have been collected in Appendix A rather than in the main text. We start by describing the framework and the notation. where Y ∈ R n is the data, X ∈ R n×p the design matrix, β ∈ R p the regression vector, and ∈ R n the random noise. We assume that the columns of the design matrix X 1 , . . . , X p ∈ R n have been standardized to have Euclidean norm X i 2 = √ n, but otherwise allow for arbitrary correlations between the columns and noise distributions. We are mainly motivated by (but not limited to) high-dimensional settings with sparse regression vectors, where the number of parameters p can rival or even exceed the number of samples n and the active set S := supp(β) = {j : β j = 0} is small.
The task is to minimize the prediction loss X β − Xβ 2 2 . To achieve this, we first reduce the dimensionality with a family of variable selection procedure parametrized by a tuning parameter, which yields a path of r data-dependent active setsŜ i ⊂ [1, . . . , p], i ∈ [r]. Possibilities for variable selection methods include Lasso, thresholded ridge regression, and many others. As there are only 2 p possible subsets, the number of sets under consideration r is always finite. We then perform a least-squares refitting on the active setsŜ i , yielding the final two-step estimate where + denotes a pseudo-inverse, and X J denotes for any subset J ⊂ [p] the matrix formed by the columns X j with j ∈ J. Our proposed scheme AV ∞ aims at minimizing the prediction loss Xβ i − Xβ 2 2 . Finally, for any I ⊂ [p], we write P I for the projection matrix X I (X I X I ) + X I , while for i ∈ [p], the Kronecker vector δ i stands for (δ i ) j = 1l[i = j].
2. Optimal Two-Stage Prediction. We first introduce the notion of optimality and then introduce Adaptive Validation for prediction (AV p ), our novel calibration scheme. We finally show that AV p is indeed fast and theoretically optimal.
In other words, the oracle set contains the active set S and has minimal cardinality among all such sets. The oracle set can therefore be viewed as the best possible approximation of the true set in S, see Figure 2.1 for an illustration. In particular, the oracle is the true active set S when the latter is contained among the path S. (For ease of presentation, we assume that the oracle set exists; however, the following can be easily generalized.) Now in practice, the oracle depends on the truth and, therefore, is unavailable. We therefore desire a scheme that select a set with the same performance for two-step prediction. To specify this, we define for a given ordered path |Ŝ 1 | ≤ |Ŝ 2 | ≤ · · · ≤ |Ŝ r | the unions asŜ i,j :=Ŝ i ∪Ŝ j and the associated two-stage estimators as In the special case i = j, we have S i,j = S i and β i := β i,i corresponds to the definition (1.2). By (1.1), any of these two-stage estimators fulfills For sufficiently large sample sizes n, variable selection methods usually provide stable (but not necessarily correct) sets of variables and, therefore, PŜ i,j 2 2 can often be sharply bounded with a χ 2 |Ŝ i,j | tail bound multiplied with some factor c > 0 (see Section 3). The above display then yields the sharp bound with high probability. This bound motivates the following assumption.
Assumption 2.1 (Prediction bounds). There are positive constants b, c > 0 such that with probability at least 1 − b, the least-squares estimators β i,j simultaneously for all i, j ∈ [r] satisfy the bounds Note that these bounds do not invoke compatibility constants or similar restrictions on the design [25], which reflects that correlations between columns are not necessarily obstructive for prediction in high-dimensional linear models [7,10,26]. For the oracle estimatorβ, which is the least-squares estimator on the oracle setS, the above bound reads as In light of this bound, a setŜ i -and the corresponding least-squares estimator β i -will be called optimal if with probability 1 − b, for some constant c > 0.
Our goal is to select an optimal set via the minimization of Inequality (2.3). This approach can be considered a refined version of the well-known AIC/BIC-type criteria. To be precise, if we invoke the model (1.1) and the definition of the estimators (2.1), and use Hölder's inequality to bound the prediction errors we see that This bound for the prediction errors can be suboptimal when compared with the previous decomposition, but it is close to the risk estimate behind AIC and BIC. Indeed, under assumptions on the sample size, the variable selection procedure, and the noise distribution, this is with high probability the same as the bound for some positive constant c . The last term is independent of the estimators, so this is precisely the AIC/BIC criterion. Note, however, that the last term is often larger than n, in which case the bound is considerably too loose; to obtain sharper guarantees, we therefore minimize Inequality (2.3) directly.

2.2.
Fast and Optimal Prediction With AV p . Having stated the notion of optimality, we now introduce Adaptive Validation for prediction (AV p ) and show that it provides fast and optimal selection among the estimators β 1 , . . . , β r . The AV p scheme is defined as follows.
Definition 2.2 (AV p ). Let a ≥ 0. The AV p is the estimator β := β i with active setŜ =Ŝî, where when the minimum exists, andî = r otherwise.
This definition, illustrated in Figure 2.2, is based on tests that are inspired by anisotropic versions of Lepski's method [5,Section 4.2.]. We note at this point that the constant a is determined by the underlying theoretical bounds (in particular, it does not need to be calibrated), see Section 4 for details.
for a given j such that |Ŝ j | ≥ |Ŝ i |. The sets are given by a Lasso path, and the indexing corresponds to small sets on the left and large sets on the right. AVp selects the first index i, starting from the left, such that the values of Remark 2.1 (Comparison to AV ∞ ). Definition 2.2 resembles the definition of AV ∞ introduced in [6]. However, AV ∞ and AV p have completely different objectives: While AV ∞ aims at optimal sup-norm estimation with Lasso, AV p aims at optimal two-step prediction with a general class of variable selection procedures. These two objects are very different and, therefore, approached very differently.
First, the tests in AV ∞ concern tuning parameters, whereas the tests in AV p concern active sets. This difference is profound: Tuning parameters have a natural order on a one-dimensional line, whereas sets are multidimensional. This multi-dimensionality cannot be captured by tests along the lines of [6] but, instead, requires more involved tests in the spirit of anistropic versions of Lepski's method. Second, AV ∞ is based on sup-norm bounds for the Lasso, whereas AV p is based on prediction bounds for a general class of estimators. Such prediction bounds are not available in the literature and, therefore, are the subject of Section 3.
We turn to the main result of this section. The following theorem demonstrates that AV p indeed provides optimal selection in the sense of Inequality (2.4).
Theorem 2.1 (Optimality of AV p ). Let a ≥ 2c and Assumption 2.1 be satisfied. Then, with probability at least 1 − b, it holds that | S| ≤ |S| and In view of Inequality (2.4), this result ensures that the estimator β performs as well as the oracle estimatorβ, up to small constants. The proof of Theorem 2.1 is surprisingly basic and, therefore, very instructive.
Proof of Theorem 2.1. The proofs of the claim rely on the bound (2.3), which holds with probability at least 1 − b. It is therefore convenient to denote by Ω the event on which these bounds hold and then work on Ω.
We prove this claim by contradiction and assume that i > v. Then, by the definition of our estimator, there must be an integer k ∈ [p] such that |Ŝ k | ≥ |S| and The fact that |Ŝ k | ≥ |S| ≥ |S|, together with the bound (2.3), yield Since a ≥ 2c, this contradicts (2.5) and, therefore, concludes the proof of Claim (i).

Claim (ii):
On Ω, it holds that Xβ − Xβ 2 2 ≤ (6a + 4c)|S|. To prove this claim, we note that by Claim 1 we have i ≤ v and therefore | S| ≤ |Ŝ v |, which implies by definition of the estimator that otherwise. The bound (2.3), on the other hand, yields Combining these two inequalities, we finally obtain which concludes the proof of Claim (ii).
We finally show that AV p can be efficiently computed. Algorithm 2.2 provides an efficient implementation as it only requires the computation of at most (r − 1)(r − 2) associated least-squares estimators. Usually, it requires even much less: When there are no cardinality ties among active sets along the path, as is often the case, it never requires the computation of more than (r − 1)r/2 least-squares estimators. In strong contrast, prediction with k-fold cross-validation requires the additional computation of k paths S, rather than one for AV p (see Table 2.2 below). Computation of paths with a variable selection procedure such as the Lasso is usually much more expensive than the computation of least-squares estimators; this explains why in practice (see Section 4), prediction with AV p can be about k times faster than prediction with k-fold cross-validation. Table 1 Prediction with k-fold Cross-Validation is costly as it requires the computation of k initial sets.

Application.
We now develop theoretical results that explain why the bounds (2.1), and therefore the optimality of the AV p scheme, should hold for a variety of first-stage estimators. The main result of this section is Theorem 3.2, which establish the bounds for tuning-parametrized firststep estimators with a symmetry property and robust paths. In particular, we cover in depth three popular variable selection procedures, the lasso, thresholded ridge regression and the square-root lasso in Subsections 3.1, 3.2 and 3.3. In what follows, it will be assumed for simplicity that the regression noise follows a normal distribution, that is, ∼ N (0, σ 2 I n ) in the notation of (1.1).
Algorithm 1: AV p as in Definition 2.2.
Let B Λ := {β λ |λ ∈ Λ} be a family of estimators indexed by a positivevalued tuning parameters Λ. As the maximal number of active sets is 2 p , we can restrict Λ to have finite cardinality without loss of generality. We denote the sets of outputs that lead to the same sign vectors for all λ ∈ Λ, Y ∈ R n , and X ∈ R n×p . Recall that in convex geometry, a polyhedron is a finite intersection of closed halfspaces -more details can be found at the start of Appendix A. The following assumption will prove convenient. i) it is scale-symmetric; ii) for all λ ∈ Λ and η ∈ {1, 0, −1} p , the closure of its regions of equal sign vector cl [W λ (η)] are polyhedra.
We will later prove in Propositions 3.1 and 3.2 that these properties are met by the lasso and thresholded ridge regression. It turns out that all estimators that have the above properties also have robust tuning parameter paths for almost all targets. To be precise, define D(Xβ) := inf{ y − Xβ ∞ | y ∈ R n s.t. for some λ ∈ Λ, supp [β λ (y)] = supp [β λ (Xβ)] for all λ ∈ Λ}, which quantifies how robust the tuning parameter paths are with respect to the noise added to the target Xβ. We remark that there is nothing special about the ∞ -norm in the definition of D beyond mathematical convenience and that generalizations to other norms could be considered. We have the following result, whose proof is relegated to Appendix A. This means that all families of estimators that fulfill i) and ii) have paths that are robust with respect to the noise added to the target ξ.
For fixed n, Theorem 3.1 ensures the robustness of the estimators under consideration. To obtain finite sample bounds, however, we need to quantify its result in terms of the sample size n. The following condition is sufficient for our goals.
Assumption 3.2. The family of estimators B Λ , the design matrix X, and the targets Xβ are such that for a sufficiently large constantc, there is an integer n 0 such that for all n ≥ n 0 D(Xβ) ≥c log n.
In contrast with Assumption 3.1, Assumption 3.2 imposes restrictions on the growth of the design matrix X and the target Xβ as n → ∞, which appear very plausible in view of Theorem 3.1 and the results of [24]. We expect that further analysis could transform Assumption 3.2 into explicit restrictions on the growth of the design matrix X and the regression vector β for specific estimators; however, such an analysis is beyond the scope of this paper. Now recall the notation in Section 2: in particular, we denoted by r the cardinality of the set of tuning parameters Λ and ordered (by increasing size) the active sets supp [β λ ] asŜ 1 , . . . ,Ŝ r . We can then state a bound for all estimators that fulfill the above assumptions, or, to be precise, Assumption 3.1 i) and Assumption 3.2.
Theorem 3.2. Let B Λ be a family of estimators fulfilling Assumption 3.1 i) and Assumption 3.2. Then, for any t > 6, there exists a deterministic integer n 0 such that with probability at least 1 − 1/t − 2t log(r)/r, This result provides the desired bound of Assumption 2.1. We remind the reader that we will show in Propositions 3.1 and 3.2 that the lasso and thresholded ridge regression satisfy at least Assumption 3.1 i) without conditions on the design or the target. Recall that r denotes the number of sets under consideration; for example, r can be the number of sets taken along the paths of the lasso or thresholded ridge regression. Note that in practice, r is typically much smaller than the total number of different sets along the paths, because the number of different sets along the paths can be of the order p or even larger [15].
3.1. The Lasso. We now prove that the lasso satisfies Assumption 3.1. To see this, we recall that for a given tuning parameter λ > 0, the lasso is any solution toβ The corresponding two-step procedure is the lasso with least-squares refitting (lslasso)β λ ∈ arg min This two-step procedure is very popular as it has smaller bias than the lasso for a range of models [2,12].
A major challenge is the calibration of the tuning parameter for the lslasso: not only is this task challenging for the lasso in itself but, to make things more complicated, tuning parameters that are suitable for the plain lasso might be completely unsuitable for the lslasso. There is a special case, however, where optimal tuning parameters are known and coincide, namely when p = n, X = √ n I n , and ∼ N(0, σ 2 I n ). In this case, the lasso is the soft-thresholding estimator and lslasso is the hard-thresholding estimator In that context, it is known that λ = 2σ 2 n log p is essentially an optimal tuning parameter for both soft-and hard-thresholding [8]. The same also holds in the slightly more general orthogonal design, where n = p and X X = n I n . However, this analysis only holds for those very specific design matrices and in realistic scenarios, the tuning parameters for the lasso and lslasso will not coincide and can be considerably different from 2σ 2 n log p.
A subtlety in showing that the lasso satisfies Assumption 3.1 is that the solution to the problem (3.2) is not necessarily unique. It is therefore of interest to define the lasso "equicorrelation" set [24] This set is unique and contains the support of any lasso solutionβ λ . More importantly, there is always a lasso solutionβ λ with support equal to the equicorrelation set: such a solution can be obtained with the modified LARS algorithm [23,24]. In light of Theorem 3.2, this means that AV p will provide optimal tuning parameter calibration for the lslasso as long as its design matrix and target grows according to Assumption 3.2.
3.2. Thresholded Ridge Regression. Besides the lasso, ridge regression [11] is popular for high-dimensional regression since it is amenable to fast and simple implementations. Combined with a subsequent thresholding, this method is particularly useful for variable selection. In the following, we show that this thresholded ridge regression (thrr) procedure satisfies Assumption 3.1. Formally, this procedure is defined aŝ β λ,γ := (X X + γI) −1 X Y 1l (X X + γ I) −1 X Y > λ and yields the active set In contrast with the lasso procedure, this estimator has two tuning parameters, the ridge parameter γ > 0 and the thresholding parameter λ > 0. For the tuning parameters γ and λ, [18] suggest cross-validation; however, this is computationally complex, especially as two tuning parameters are involved, and this method does not provide any theoretical guarantees. Now consider the least-squares refitted version of thresholded ridge regression (lsthrr), which is defined as In the simulations, we show that AV p can be used to calibrated both tuning parameters λ and γ, see Section 4. In this theoretical part, however, we focus on the calibration of λ and assume γ fixed. In particular, we show that Assumptions 3.1 is met for thresholded ridge regression with fixed γ. In light of Theorem 3.2, this means that AV p will provide optimal tuning parameter calibration in λ for least-squares refitted thresholded ridge regression as long as its design matrix and target grows according to Assumption 3.2.
3.3. The Square-Root Lasso. An interesting approach to mitigate the influence of tuning parameters is the square-root lasso [3] (and very similarly the scaled lasso [1,20]) where γ > 0 is a tuning parameter. The main feature of the square-root lasso is that its tuning parameter γ does not need to be calibrated to the noise variance. For some special purposes, even a universal tuning parameter of the order √ n log p can lead to acceptable results. In general, however, universal tuning parameters do not lead to optimal results; instead, the tuning parameter of the square-root lasso needs to be carefully calibrated to the noise distribution and the design matrix. As for the lasso, this calibration is an open question in the literature.
Just as for the lasso, AV p provides optimal two-step prediction for the square-root lasso. More precisely, the two procedures are completely interchangeable in our framework, as the ordered active sets of the square-root lasso coincide with the ordered active sets of the lasso. To see this, note that for any lasso tuning parameter λ, there is a square-root tuning parameter γ such that β γ 1 = β λ 1 (and vice versa). One can then check that This implies that there is a one-to-one correspondence between the tuning parameter paths of the lasso and the square-root lasso and, therefore, lasso and square-root lasso are interchangeable in our framework. In particular, AV p provides optimal calibration for both estimators.
The one-to-one correspondence between the paths of the lasso and the square-root lasso especially implies that AV p applied to these methods yields the same two-step prediction if the entire paths are considered (that is, if r is the total number of active sets along the paths). In practice, however, only a part of the paths are considered (that is, r is smaller than the total number of active set along the paths). Then, AV p does not necessarily yield the same results for the lasso and the square-root lasso but instead is optimal for each of the considered families of estimators individually.
Most interesting for our purposes is that we can use an approach based on the square-root lasso to obtain a fast, rough estimate of the noise variance. This is sufficient for AV p , which only weakly depends on an accurate estimate of this quantity: we refer to Section 4 for details.

Experiments.
We measure the numerical performance and the computational speed of two-step prediction with AV p and with 10-fold crossvalidation, which is typically regarded as the standard calibration scheme. The lasso and thrr are chosen as variable selection methods. These variable selection methods are not directly compared; rather, different calibration schemes are compared for two-step prediction for each of these methods.
The data are generated according to a linear regression model as in (1.1) with n = 200 and a varying number of parameters p ∈ {100, 200, 500, 1000}. The first 10 entries of the regression vector β are set to 1, while all other entries are set to 0. The components of the noise vector are independently sampled from a univariate standard normal distribution with mean 0 and variance 1. The rows of the design matrix X are independently sampled from Algorithm 2: Scaled Lasso algorithm with early stopping.
Data: Y, X, δ; Result: σ; Initialize λ0 ← √ 2n log p and σ ← 1; repeat Save σ ← σ; Update σ: Set λ ← σλ0; Computeβ λ as the lasso with tuning parameter λ according to (3.2); a multivariate normal distribution with mean 0 and covariance matrix Σ, which is set to Σ ij = 1 for i = j and to Σ ij = ρ for i = j. For ease of presentation, we restrict to ρ = 0.5. Subsequently, the columns of X are normalized to Euclidean norm √ n. For all experiments, we perform 50 repetitions.
In addition to the described parameter settings, we tested various other settings, including different correlation coefficients ρ, regression vectors β, and tuning parameter grids. As the conclusions were virtually the same across all settings, we restrict our presentation to the ones described. All computations are conducted with the standard implementations of the lasso and ridge regression in Python scikit-learn (version 0.15) [16].

4.1.
Practical choice of a. If σ 2 is known, we recommend to use Definition 2.2 with a = σ 2 as suggested by Theorem 3.2 (regarding the term t log r as a superfluous term coming from our proof technique). In practice, however, the noise variance σ 2 is typically unknown, and we then advocate to use a = σ 2 with a rough estimate σ 2 . To obtain such an estimate, one can use one of the fast algorithms for the square-root lasso [4] or the scaled lasso [21] with (very) early stopping. For our simulations, we have opted for the latter, which consists of an alternating minimization for estimating both the regression parameter and the noise level. Algorithm 2 states our concrete implementation. We set the tolerance as δ = 10 −2 , which typically leads to less than five iterations of the loop in the algorithm and therefore, as illustrated below, to very small computational costs.

4.2.
Choice of the tuning parameter grids. In the case of the lasso, the tuning parameter grid is chosen as a default grid in the lasso function in scikit-learn. More precisely, we take a geometric grid of size r = 10 starting from λ max = X Y ∞ , the smallest tuning parameter that leads to a lasso Computation of one tuning parameter path of lasso. lslasso path: Computation of one tuning parameter path of lasso with least-squares refitting. lassoAV p : Least-squares refitted lasso with tuning parameter selected by AV p with a = σ 2 .
lassoAV p (with σ): Least-squares refitted lasso with tuning parameter selected by AV p and a = σ 2 as detailed above. lassoCV: Least-squares refitted lasso with tuning parameter selected by 10-fold cross-validation on the estimates of the (one-step) lasso.
lslassoCV: Least-squares refitted lasso with tuning parameter selected by 10-fold cross-validation on the estimates of least-squares refitted lasso.
We then also consider the corresponding estimators for thresholded ridge regression (thrr). A comparison of AV p with and without the estimation of σ shows that the computational time for the rough estimation of the noise level is negligible. Figure 3 also shows that AV p is about a factor 10 times faster than cross-validation. In some cases, the computation of AV p is even faster than the computation of a single lasso path, because AV p can stop early (that is, does not have to consider all supportsŜ i,j , cf. Definition 2.2 and Algorithm 1). The statistical performances in terms of prediction error for lassoAV p (with σ), lassoCV, and lslassoCV are reported in Figure 4 and for thrrAV ∞ and lsthrrCV in Figure 5.
In conclusion, our simulations demonstrate that AV p is highly competitive both in prediction performance and in computational speed.

Conclusions.
We have introduced AV p , a novel calibration scheme for two-step prediction in linear regression. Our simulations demonstrate that AV p has excellent performance for prediction with the lasso and thresholded ridge regression. In contrast to standard calibration schemes, such as cross-validation and BIC/AIC-type criteria, AV p also allows for optimal theoretical guarantees. In addition, AV p can drastically decrease the computational costs as compared to standard schemes. In our context, AV p is, therefore, the only calibration scheme that simultaneously provides theoretical guarantees, practical performance, and computational speed. Our work could be extended as follows. Although the assumptions in Theorem 3.1 appear to hold in practice, it would be of interest to find explicit conditions on the design matrix X and the regression vector β that would guarantee Assumption 3.2 for the lasso or thresholded ridge regression. In addition, it would be of interest to relax the assumption of polyhedral regions in Theorem 3.1 to extend our results to variable selection procedures with non-linear boundaries. Finally, we expect that the Gaussian noise assumption running through Section 3 (but which is not needed for Section 2) could be weakened, for example, to account for sub-gaussian noise.

APPENDIX A: PROOFS
The following subsections discuss two first-step estimators, the lasso and thresholded ridge regression, and the proof of Theorem 3.1. The proofs rely on various convex geometry notions, so we want to remind the reader some background on the subject.
The affine hull of a set A, denoted aff A is the intersection of all affine spaces that contain A, or alternatively, the unique affine set of minimal dimension that contains A. We write, by extension, dim A = dim aff A. The relative interior and boundary of A, denoted relint A and relbd A, are re-spectively the interior and the boundary when A is seen as a subset of its affine hull.
A half-space H + is a set of the form {x ⊂ R n | a x ≤ b} for a ∈ R n , b ∈ R. Its boundary H = ∂H + = {x ⊂ R n | a x = b} is a hyperplane in R n . A polyhedron is a finite intersection of half-spaces, X = i∈I H + i . A polyhedron can usually be represented in many such decompositions; we call a decomposition irreducible if j =i H + j = X for all i ∈ I. A facet of X is a set F i = X ∩ H i , given an irreducible decomposition; the faces are the intersections of (potentially many) facets. The normal cone to a point One can show [17, p. 83] that for a given face F , all x 0 ∈ F have the same normal cone; hence we define the normal cone to F to be N (X, F ) = N (X, x 0 ) for any x 0 ∈ F . It will be useful to introduce the following concepts. The "regions" are the closures of the sets of equal sign vector; following the notation of Section 3, we write V = {cl W 1 (η) | η ⊂ {−1, 0, 1} p }. To simplify notation, we will write the target as ξ = Xβ. We will also write ξ n to remind the reader this depends on n, when relevant.
Proof of Theorem 3.2. We split the proof in two parts. We first show that on an appropriate scale, Y must be close to the target with high probability. This is then shown to imply, using the scale-symmetry property, that the ordered active sets must be unique with high probability. Second, this is used to bound PŜ 2 2 using χ 2 quantiles.
Proof of Proposition 3.1. We prove each condition in order. Part i) The scale-symmetry follows from consideration of the dual problem to (3.2). Letβ λ be a lasso solution whose active set equals the equicorrelation set E[λ; Y ]. Let C λ stand for the polyhedron {x | X x ∞ ≤ λ}, and let P C λ denote the Euclidean projection on this set. Notice that for any x ∈ C λ , we have x/λ ∈ C 1 and therefore Since this is true for all x ∈ C λ , and that λP C 1 Y λ ∈ C λ , we conclude that λP C 1 Y λ = P C λ (Y ). Now, as shown in [24], the residual from the lasso satisfies Therefore, for any λ > 0 the active set ofβ λ satisfies as desired.
Proof of Proposition 3.2. We prove each condition in order. Part i) By definition of the estimator, the active set is so V is a finite intersection of half-spaces, i.e., a polyhedron. Moreover, i.e., sgnβ is constant over int V , as desired.
A.1. Proof of Theorem 3.1. We begin with some preliminary observations before delving into the proof of the theorem.
Recall the definition of V from the start of Section 3. If Assumption 2.1 holds, then the regions must have disjoint interiors. Indeed, for Take V = V ∈ V again: being polyhedra, their boundaries ∂V , ∂V can be partioned by the relative interiors of their proper faces [17,Thm. 2.1.2]. Let F(V ) denote the set of proper faces of a polyhedron V and C = {relint F ∩ relint F | F ∈ F(V ), F ∈ F(V ), V = V ∈ V} be the collection of "boundary pieces". We enumerate, for reference, two properties of C: i) For two distinct V, V ∈ V and an x ∈ ∂V ∩ ∂V , there is a unique B ∈ C such that x ∈ B. ii) Each B ∈ C has dimension at most n − 1.
Indeed, for the first statement we notice that by partitioning, there exists unique faces F ∈ F(V ) and F ∈ F(V ) such that x ∈ relint F, relint F , hence a unique B ∈ C such that x ∈ B. For the second, since the interiors of V, V are disjoint, for B = relint F ∩relint F = ∅ we must have F ∈ ∂V ,F ∈ ∂V , hence dim B ≤ dim relint F ∧ dim relint F ≤ n − 1. In addition to these observations, we will need the following two lemmas and one supporting proposition.
Lemma A.1. If B ∈ C is of dimension at most n − 2, then R + B is of dimension at most n − 1. Proof. By the decomposition, x ∈ X ∩ H + j for all j; say for j = i we had x ∈ F j = X ∩ H j . As argued in [9, p. 27] Lemma A.2. Let B ∈ C be of dimension n − 1 and L be a ray centered at the origin such that B ∩ L = ∅. Then either R + B has dimension n − 1, or L ∩ int V = ∅ and L ∩ int V = ∅.
Proof. Write B = relint F ∩ relint F . Since relint F , relint F both have dimension at most n − 1, and B has dimension n − 1, then they must have exactly dimension n − 1. They are thus facets of their respective polyhedra V, V ∈ V, which must have dimension n, and have exactly one supporting hyperplane. Thus aff B = aff F = aff F , which we might denote S. Let b ∈ L ∩ S, write S = a 1 , ..., a n−1 + b for some linearly independent vectors a 1 , ..., a n−1 , and let a n be such that L = R + a n . Consider the affine transformation φ(x) = [a 1 , ..., a n−1 , a n ] x + b = Ax + b, which maps vectors (x 1 , ..., x n−1 , 0) bijectively to S, and vectors (0, ..., 0, x n ) with x n ≥ − b / a n bijectively to L. (Recall that b ∈ S ∩ L.) There are then two possibilities.
Case i) A has rank n − 1. Then a n ⊂ a 1 , ...a n and L ⊂ S. But this means that 0 ∈ L ⊂ S, i.e., that S is a subspace. Then R + B ⊂ RS = S, i.e., R + B has dimension n − 1.
We may now turn to the proof of the theorem.
Proof of Theorem 3.1. Every region V ∈ V is a polyhedron, so has a decomposition V = P (V ) + C(V ) into a polytope P (V ) and a cone C(V ) by Minkowski's theorem [28,Thm. 1.2].
The set R is a finite union of sets of dimension at most n − 1, so is closed and has measure zero. We argue that if ξ ∈ R C , then R + ξ ⊂ int T . Indeed, say that tξ ∈ ∂T for some t > 0. Since ∂T ⊂ V ∈V 0 ∂V i , there is a V 0 ∈ V 0 such that tξ ∈ ∂V 0 . As tξ ∈ ∂T hence tξ ∈ cl (T C ) = V ∈V 0 V , since R n = V ∈V V . Thus there must also be a V 1 = V 0 , V 1 ∈ V 0 such that tξ ∈ V 1 . But since the interiors are disjoint, if tξ ∈ int V 1 there would be a contradiction with tξ ∈ ∂V 0 ; hence tξ ∈ ∂V 1 . Thus tξ ∈ ∂V 0 ∩ ∂V 1 and there must be an unique B ∈ C such that tξ ∈ B. That piece, like all elements of C, must be of dimension n − 1 or lower. We argue that B ⊂ R. If it has dimension n − 2, then dim R + B ≤ n − 1 by Lemma A.1, so B is indeed a subset of R. Now say instead it has dimension n − 1 and recall that tξ ∈ R + ξ ∩ B. Let F ∈ F(V 0 ) and F ∈ F(V 1 ) be such that B = relint F ∩ relint F . By Lemma A.2, we must have either dim R + B = n − 1, or R + ξ ∩ int V 0 = ∅ and R + ξ ∩ int V 1 = ∅. But if the latter was the case, then V 1 ⊂ T by definition, which is impossible; thus we must have dim R + B = n − 1, hence B ⊂ R again.
Thus in all cases, B ⊂ R. Since tξ ∈ B, this means d(tξ, R) = 0. But at the same time, since x ∈ R C and R is closed we must have d(ξ, R) > 0, and since R is invariant under positive multiplication, d(tξ, R) = inf y∈R tξ − y = t inf y/t∈R ξ − y = t inf y∈R ξ − y = td(ξ, R) > 0. This is a contradiction, and we conclude that R + ξ ⊂ int T . Now, V 0 is finite, since C has only a finite number of faces. Let h be the continuous map t → tξ, and consider for each V ∈ V 0 the closed set h −1 (V ). This set must be convex, since for s, t Enumerate arbitrarily the V ∈ V 0 as V 1 , ..., V m , and for Then some b i must equal ∞, otherwise the union would be bounded. Moreover, since the interiors of the V 's are disjoint, (a i , b i ) ∩ (a j , b j ) = ∅ for i = j and the b i = ∞ must be unique, all the others finite. By reordering if necessary, take 0 = a 1 < b 1 ≤ a 2 < b 2 ≤ ... ≤ a m < b m = ∞.