Uniformly Valid Confidence Sets Based on the Lasso

In a linear regression model of fixed dimension $p \leq n$, we construct confidence regions for the unknown parameter vector based on the Lasso estimator that uniformly and exactly hold the prescribed in finite samples as well as in an asymptotic setup. We thereby quantify estimation uncertainty as well as the"post-model selection error"of this estimator. More concretely, in finite samples with Gaussian errors and asymptotically in the case where the Lasso estimator is tuned to perform conservative model selection, we derive exact formulas for computing the minimal coverage probability over the entire parameter space for a large class of shapes for the confidence sets, thus enabling the construction of valid confidence regions based on the Lasso estimator in these settings. The choice of shape for the confidence sets and comparison with the confidence ellipse based on the least-squares estimator is also discussed. Moreover, in the case where the Lasso estimator is tuned to enable consistent model selection, we give a simple confidence region with minimal coverage probability converging to one. Finally, we also treat the case of unknown error variance and present some ideas for extensions.


Introduction
The Lasso estimator as introduced in Tibshirani (1996) as well as many variants thereof have gained strong interest in the statistics community and in applied areas over the past two decades.As is well known, the main attraction of the Lasso estimator lies in its ability to perform model selection and parameter estimation at very low computational cost, see for instance Alliney & Ruzinsky (1994), Efron et al. (2004) and Rosset & Zhu (2007), and in the fact that the estimator can be used in high-dimensional settings where the number of variables p exceeds the number of observations n ("p n").
Literature on distributional properties of the Lasso estimator in the low-dimensional setting (p ≤ n) include the often-cited paper by Knight & Fu (2000) who derive the asymptotic distribution when the estimator is tuned to perform conservative model selection.Pötscher & Leeb (2009) give a detailed analysis in the framework of a linear regression model with orthogonal design and derive the distribution of the Lasso estimator in finite samples as well as in the two asymptotic regimes of consistent and conservative tuning.Implications of these results for confidence intervals are analyzed in Pötscher & Schneider (2010) and generalizations to a moderate-dimensional setting where p ≤ n but p diverging with n are contained in Pötscher & Schneider (2011) and Schneider (2015).
In a high-dimensional setting with p n, confidence regions and confidence intervals in connection with the Lasso estimator have recently been treated in a number of papers including Zhang & Zhang (2014), Van de Geer et al. (2014), Javanmard & Montanari (2014), Caner & Kock (2014) and Van de Geer (2015).All these papers use the idea of "de-sparsifying" the Lasso estimator which in the case of p ≤ n essentially reduces to using the least-squares (LS) estimator for inference.In that sense this theory leaves a gap on how to construct confidence regions based on the Lasso estimator in a low-dimensional framework to provide uncertainty quantification for the Lasso estimator also in this case.Lee et al. (2013) consider finite-sample results for confidence intervals in connection with the Lasso estimator yet these authors take a different route in that their intervals are not set to cover the true parameter, but a pseudo-true value that depends on the selected model and coincides with the true parameter if the selected model is correct.All inference is conditional on the selected model.Their method is in line with the general proposal of Berk et al. (2013) who discuss an intricate procedure for obtaining confidence regions for this pseudo-true parameter after a model selection step.
In this paper, we construct confidence sets based on the Lasso estimator for the entire unknown parameter vector.One of the challenges of this task lies in the well-known fact that the finite-sample 1 arXiv:1507.05315v1[math.ST] 19 Jul 2015 distribution of the Lasso estimator depends on the unknown parameter in a complicated manner.This phenomenon does not vanish for large samples as can be seen within a so-called moving-asymptotic framework (see Pötscher & Leeb (2009) for a detailed analysis in orthogonal design) and also occurs for related estimators.In order to construct valid confidence sets, we need to know the smallest coverage probability occurring over the whole parameter space.Pötscher & Schneider (2010) derive a formula for the minimal coverage probability of fixed-width confidence intervals based on the Lasso estimator in one dimension using knowledge of its finite-sample distribution.In the general case, this finite-sample distribution is not known, so it is not clear how to obtain an expression for the coverage probability in more than one dimension.Additionally, this coverage probability clearly depends on the shape that is used for the confidence set and it is a not clear a priori what this shape should be.We do the following.
While the finite sample distribution and therefore the coverage probability for any kind of set based on the Lasso estimator is unknown in general dimensions, we show that computing the minimal coverage probability can actually be carried out without this explicit knowledge.We obtain an explicit formula for the minimal coverage probability by, in a way, deferring the minimization problem into the objective function that defines the estimator, as is depicted in Section 3.For the confidence regions, we consider a large class of shapes that is determined by a condition involving the regressor matrix.This class encompasses the elliptic shape one would use if the confidence region was based on the LS estimator, thus enabling comparisons with the LS confidence ellipse.Analogously to the fixed-width intervals in Pötscher & Schneider (2010), the confidence regions we consider are random only through their centering at the Lasso estimator (which is also in line with the setup in the literature for high-dimensional settings, see for instance Van de Geer et al. (2014)).Asymptotically, we distinguish between two regimes for the tuning parameters which we call conservative and consistent tuning.As suggested from the results in Pötscher & Schneider (2010), our results from finite samples essentially carry over asymptotically when the estimator is tuned conservatively.In the case of consistent tuning, the uniform convergence rate of the estimator is slower than n −1/2 and we give the asymptotic distribution of the Lasso estimator when scaled by the appropriate factor corresponding to the uniform convergence rate, as well as suggesting a simple construction for a confidence set in that case.
The remaining paper is organized as follows.In Section 2 we set the framework by stating the model, defining the estimator and introducing some notation.The main result giving the formula for the minimal coverage probability is presented in Section 3 and subsequently Section 4 is devoted to discussing how to concretely construct the corresponding confidence sets, as well as their relationship to the confidence ellipse based on the LS estimator.In Section 5 we derive asymptotic results both for the case of conservative and the case of consistent model selection.Section 6 concludes.All proofs are deferred to Section 7.

Setting and Assumptions
Consider the linear model where y is the observed n × 1 data vector, X the n × p regressor matrix which is assumed to be nonstochastic with full column rank p, β ∈ R p is the true parameter vector and ε the unobserved error term defined on some probability space (Ω, A, P ) and consisting of independent and identically distributed components with mean 0 and finite variance σ 2 .We consider a componentwise tuned Lasso estimator βL , defined as the unique solution to the minimization problem where λ n,j , are non-negative and non-random componentwise tuning parameters that allow to exclude parameters from penalization.Note that if λ n,j = 0 for all j, this estimator is equal to the ordinary least-squares (LS) estimator βLS and that λ n,j = c > 0 for all j corresponds to the "classical" Lasso estimator as proposed by Tibshirani (1996).For later use, let λ n = (λ n,1 , . . ., λ n,p ) and Λ n = diag(λ n ), the diagonal matrix whose diagonal elements are given by the components of λ n .We use 1 {.} for the indicator function and make the following obvious definitions.For a ∈ R p and B ⊆ R p , the set

Finite Sample Results
We aim to construct confidence sets for the entire parameter vector β based on the Lasso estimator βL .That means that for a non-random set M ⊆ R p , we consider sets of the form which have to satisfy that the probability of actually covering the unknown parameter β never (for no value of β) falls below a prescribed level 1 − α with α ∈ [0, 1].In other words, we need P β (β ∈ βL − M ) ≥ 1 − α for all β ∈ R p (where we stress the dependence of the probability measure on β whenever it occurs), so that inf In order to achieve this, we need to be able to compute this "infimal" (minimal) coverage probability.Throughout this and the subsequent section we suppose that the errors as normally distributed an assumption that will be removed for asymptotic results in Section 5. We will show that the minimum occurs when the components of the unknown parameter become large in absolute value by essentially doing the following.We reparametrize the objective function defining the Lasso estimator so that the dependence on the unknown parameter becomes more transparent and easier to handle.We then consider the limiting cases of the objective functions when the components of the unknown parameter vector β become large in absolute value (that is, tend to +∞ or −∞).We will see that it is possible to minimize the resulting objective functions explicitly, with minimizers that follow a shifted normal distribution that has the same variance-covariance matrix as the LS estimator and by construction do not depend on the unknown parameter.Finally, we will show that the infimal coverage probability of the proposed sets is indeed "achieved" for one of these finitely many limiting cases.
To state the main theorem, we need several definitions.First we define the reparametrized objective function Q n (u) = L n (β + n −1/2 u) − L n (β) so that Q n is uniquely minimized at ûn = n 1/2 ( βL − β), the estimation error scaled by n 1/2 .Of course, this scaling factor is arbitrary in finite samples, but proves to be of advantage when considering the problem in large samples in Section 5.1.We can write Q n as Note that for a set M ⊆ R p we then have The above mentioned limiting cases of the objective function that we consider are defined as where d = (d 1 , . . ., d p ) ∈ {−1, 1} p .Holding W n fixed for a moment, we indeed see that As shorthand notation, we write ûd n for the unique minimizer of Q d n .To define the shape that we want to consider for the confidence regions, we introduce the following notation.For m ∈ R p , a vector The set A d C (m) is an intersection of 2p half-spaces, p of which determine the orthant the set is located in via the parameter d.The other p half-spaces are defined by hyperplanes that intersect at the point m. Figure 1 shows one example of such a set.Note that in general, A d C (m) could be non-empty also for sgn(m) = −d.The sets we consider are determined by the following condition.
Condition A. Let C ∈ R p×p be given.We say that a set M ⊆ R p satisfies Condition A with matrix The above condition will be discussed in more detail in Section 4. Using this notation, we can now state the main theorem.
The distributions of ûd n determining the formula for the infimal coverage probability are shifted normal distributions with the same variance-covariance matrix as the corresponding (shifted and scaled) LS estimator ûLS = n 1/2 ( βLS − β) and mean that depends on the regressors and the vector of tuning parameters.Since Condition A for p = 1 simply requires the corresponding set M n to be an interval containing zero, Theorem 1 is indeed a generalization of the formula in Theorem 5(a) in Pötscher & Schneider (2010), as discussed in the introduction.(To make the connection, note that the tuning parameter η n in that reference corresponds to a component n −1/2 λ n,j of the vector of tuning parameters in our paper.)The following obvious corollary specifies the resulting valid confidence region based on the Lasso estimator.

Constructing the Confidence Set
We now turn to discussing the important matter of how to choose an appropriate set M n ⊆ R p for some desired level of confidence 1 − α by discussing concrete shapes for the confidence regions as well as their size and relation to confidence sets based on the LS estimator.As mentioned in the previous section, we need to find a set M n ⊆ R p that satisfies Condition A with C = C n and such that min d∈{−1, The resulting confidence set for β is then the scaled and shifted set βL − M n /n 1/2 .If we would base the set on the LS estimator βLS instead of βL , the canonical and best choice for M n in terms of volume is an ellipse determined by the contour lines of a N (0, Given the fact that the variance-covariance matrix of the distributions of ûd n is in fact σ 2 C −1 n , in addition to the fact that the means of the distributions average to 0, it is reasonable to consider the C n -ellipse as a shape in connection with the Lasso estimator also.As stated in the following proposition, this shape complies with Condition A.
Proposition 3. The C n -ellipse given by How to choose the parameter k for a given level of coverage 1 − α is stated in the next proposition.
Proposition 4. For any k > 0, we have that Note that if d * solves the above optimization problem, so does −d * .To finally obtain the confidence ellipse based on the Lasso estimator, pick any such optimizer d * and compute k * > 0 so that P (u d * n ∈ E Cn (k * )) = 1 − α, which is easily done numerically.Note that Proposition 4 also shows that the ellipse E Cn (k * ), and therefore the resulting confidence set based on the Lasso estimator, is larger in volume than the one based on the LS estimator, since E Cn (k * ) needs to be large enough as to have mass 1 − α with respect to the -measure whereas for the ellipse corresponding to the LS estimator, it suffices to have mass 1 − α with respect to the N (0, σ 2 C −1 n )-measure.Clearly, the difference in size will increase as the tuning parameters become large.These observations are in line with the findings in Pötscher & Schneider (2010) who show that a confidence interval based on the Lasso estimator is larger than a confidence interval based on the LS estimator with the same coverage probability.When comparing the two confidence sets, we emphasize that since the ellipses are centered at different values, the smaller ellipse based on the LS estimator is in general not contained in the ellipse based on the Lasso estimator.This, as well as the difference in volume between the two ellipses, will also be illustrated in the example below.
It is quite obvious that the C n -ellipse is not optimal as a shape for confidence sets based on the Lasso estimator since we can get higher coverage with a set of the same volume by adjusting the ellipse "towards" the contour lines of the such a way that Condition A is preserved).To find the best shape possible, one would have to minimize the volume of the set over all possible shapes satisfying Condition A subject to the constraint of holding the prescribed minimal coverage probability.This is a highly complex optimization problem and we do not dwell further on this subject here, but illustrate possible ways to construct "good" sets, as shown in the example below.Before discussing this further, note that the following proposition shows that it is easy to find the closure of an arbitrary subset of R p with respect to Condition A. Proposition 5.For any M ⊆ R p , the set is the smallest set containing M that satisfies Condition A.
We now provide an example for p = 2 illustrating the difference between the confidence ellipse based on the LS estimator and the one based on the Lasso, as well as how to choose a better shape in terms of volume for the confidence set based on the Lasso estimator.The simulations and calculations were carried out using the statistical software package R.The example is set up in the following way.We let n = 20 and generate the (n × 2)-matrix X using independent and identically distributed standard normal entries that are transformed row-wise by an appropriate (2 × 2)-matrix in order to get We generate the data vector y from the corresponding linear model with σ 2 = 1 (so that ε ∼ N (0, I n )) and true parameter chosen as β = (1, 0) .We compute the Lasso estimator using the glmnet-package and tuning parameters λ n,1 = λ n,2 = √ n/2 (asymptotically corresponding to what we will refer to as conservative model selection in the subsequent section).We also considered estimators where the tuning parameters were chosen by 10-fold cross-validation (as provided in the glmnet-package) which ended up yielding comparable results for the estimator.
We then constructed confidence ellipses with level α = 0.05 based on both the LS and the Lasso estimator in the manner described earlier in this section.The resulting sets are shown in Figure 2. The plot clearly illustrates the above described fact that the confidence ellipse based on the Lasso estimator is larger than the confidence ellipse that is based on the LS estimator.Also, the two sets are overlapping by a large amount (in fact, the maximal distance between the two estimators is controlled by Proposition 13 in the Appendix).However, the LS ellipse is not entirely contained in the one based on the Lasso, stressing the fact the Theorem 1 yields non-trivial sets.
The above comparison between the two ellipses, however, is somewhat unfair in the sense that the shape used for both confidence sets is the optimal one (in terms of volume) for the LS estimator, but, as discussed above, not for the Lasso estimator.With the optimal shape for a Lasso confidence set being unknown, we at least want to find a shape that improves upon the ellipse.As a basis for this, we consider the union of the contour sets corresponding to the distributions of ûd n , that is, the 2 p shifted C n -ellipses where each set in the union is of optimal shape for the corresponding distribution of ûd n .As a starting point, we choose k so that P (û that k is then simply the parameter of the C n -ellipse used for the LS estimator, but any k > 0 such that U n (k) satisfies Clearly, this set is still too large and will not satisfy Condition A, so we need to address these two issues.First, we add all points necessary so that the resulting set satisfies Condition A. Proposition 5 ensures that fulfills the desired condition.Note that in this particular case, it is fairly straightforward to see that this set is simply given by the convex hull of the shifted ellipses U n (k).Finally, to get the smallest set with this shape that still holds the prescribed level of coverage, we iteratively adjust the set by reducing the parameter k and re-calculate the minimal coverage probability of the resulting set until the desired minimal coverage probability is reached (up to an arbitrary level of precision).The resulting alternatively shaped set is depicted in Figure 3, (a) showing the midpoints of the 2 p = 4 ellipses used in the construction and (b) displaying the new confidence set on top of the elliptic confidence region based on the Lasso as devised before.It is obvious that the new shape has slightly less volume than the ellipse.

Asymptotic Framework
We now derive asymptotic results that hold without assuming normality of the errors.Additionally to the assumptions in Section 2, for all asymptotic considerations, we assume that X = (x 1 , . . ., x n ) where x i ∈ R p , meaning that the regressor matrix X changes with n only by appending rows, and that as n → ∞, where C is finite and positive definite.This setting assures consistency and asymptotic normality of the LS estimator.We will consider two different regimes of the asymptotic behavior of the tuning parameter λ n and start with the regime we call conservative tuning.

Conservative Tuning
In this regime and throughout this subsection, we require that as n → ∞.This implies that λ n,j /n → 0 for all j = 1, . . ., p , which in turn implies consistency of βL (see Theorem 1 in Knight & Fu (2000) with the slight modification that in our paper we allow for componentwise defined tuning parameters).We let Λ = diag(λ).
Remark 1.Such a choice of tuning parameters indeed yields a conservative model selection procedure in the sense that lim sup n→∞ sup for each j = 1, . . ., p.In particular, if β j = 0, we have The latter statement was also noted by Zou (2006) in Proposition 1.
The following proposition implicitly states the asymptotic distribution of the estimator in a so-called moving-parameter framework.This proposition essentially is Theorem 5 from Knight & Fu (2000) and can be proven in the same manner simply by adjusting for componentwise tuning.
Note that the vector t takes over the role of n 1/2 β in the finite-sample version of the function, Q n , where the cases of n 1/2 β j = ±∞ are now included in the asymptotic setting.Also, the assumption of n 1/2 β n converging in R p is not a restriction in the sense that, by compactness of R p , Proposition 6 characterizes all accumulation points of the distributions (with respect to weak convergence) corresponding to completely arbitrary sequences of β n .
Similarly to the finite-sample case, we define û to be the unique minimizer of Q, and for d ∈ {−1, 1} p , we define Q d (u) = u Cu − 2W u + 2 p j=1 λ j d j u j with unique minimizer ûd .We can then formulate an asymptotic version of Theorem 1.
Given this result we can, again, construct asymptotically valid confidence sets for the parameter β in the following way.
We find that asymptotically in the case of conservative tuning, we essentially get the same results as in finite samples when assuming normally distributed errors.The only difference is that the minimal coverage holds asymptotically and that the quantities C n and n −1/2 Λ n have settled to their limiting values C and Λ, respectively.

Consistent Tuning
In the second regime and throughout this subsection, we suppose that λ n,j n 1/2 −→ ∞ for at least one j with 1 ≤ j ≤ p as well as for all j = 1, . . ., p as n → ∞, where the latter condition ensures estimation consistency of the estimator.We refer to this regime as consistent tuning to highlight the contrast to conservative tuning where λ n,j /n 1/2 converges for each j = 1, . . ., p. Yet we emphasize that in order to ensure P β ( βL,j = 0) → 1 whenever β j = 0, we would need λ n,j /n 1/2 → ∞ for each j = 1, . . ., p as well as need additional conditions on the regressor matrix X.We refer the reader to Zou (2006), Zhao & Yu (2006) and Yuan & Lin (2007) for a discussion concerning necessary and sufficient conditions on X in this context.
In the case of consistent tuning, the rate of the estimator is no longer n −1/2 , neither when looked at in a fixed-parameter asymptotic framework (as has been noted by Zou (2006) in Lemma 3), nor (a fortiori) within a moving-parameter asymptotic framework, as discussed in in Pötscher & Leeb (2009) in Theorem 2. The latter reference shows that the correct (uniform) convergence rate depends on the sequence of tuning parameters λ n .Since we allow for componentwise tuning, in fact, the rate depends on the largest component of the vector of tuning parameters, as can be seen from the following proposition.We define and λ 0 = (λ 0,1 , . . ., λ 0,p ) by λ n,j /λ * n −→ λ 0,j ∈ [0, 1] for each j = 1, . . ., p as n → ∞.Note that λ 0,j = 1 for all j in case all components are equally tuned.
(In contrast to the finite-sample and the conservative case, we make the dependence of the objective function V ζ on the unknown parameter ζ ∈ R p apparent in the notation to clarify what we do in the following).Proposition 9 shows that λ * n /n is indeed the correct (uniform) convergence rate as the limit of n( βL − β)/λ * n is not 0 in general.The proposition also reveals that in the consistently tuned case, when scaled according the correct convergence rate, the limit of the sequence of estimators is always non-random, a fact that in a moving-parameter asymptotic framework has already been noted in the one-dimensional case in Pötscher & Leeb (2009).This fact allows us to construct very simple confidence sets in the case of consistent tuning by first observing that the limit of n( βL − β)/λ * n is always contained in a bounded set which is described in Proposition 10.To this end, define the set and note that the following can be shown.
Proposition 10.The set M can be written as Thus M can be viewed as a box distorted by the linear function C −1 , a bounded set in R p .In fact, this turns out to be a parallelogram whose corner points are given by the set {C −1 Λ 0 d : d ∈ {−1, 1} p }, where Λ 0 = diag(λ 0 ).Note that fittingly, these corner points can be viewed as the equivalent of the means in the normal distributions (determining the minimal coverage probability) in the conservative case in Theorem 7, appearing without randomness in the limit in the consistently tuned case.Using Proposition 10, a simple asymptotic confidence set can now be constructed as is done in the following corollary.
Corollary 11.We have Note that nothing can be said about the boundary case d = 1.This corollary is a generalization of the simple confidence interval given in Proposition 6 in Pötscher & Schneider (2010).Finally, also note the set M is not required to satisfy Condition A and, in fact, will not comply with this condition for certain matrices C.

Conclusion
We consider confidence regions based on the Lasso estimator covering the entire unknown parameter vector thereby quantifying estimation uncertainty of this estimator.We provide exact formulas for the minimal coverage probability of these regions in finite samples and asymptotically in a low-dimensional framework when the estimator is tuned to perform conservative model selection.We do this without explicit knowledge of the distribution but by carefully exploiting the structure of the optimization problem that defines the estimator.The sets we consider as confidence regions need to satisfy certain shape constraints which apply to the regular confidence ellipse based on the LS estimator.We show that the LS confidence ellipse is always smaller than the one based on the Lasso estimator, but not contained in the Lasso ellipse in general.An ellipse is not the optimal shape for the confidence region based on the Lasso estimator in terms of volume.We give some guidelines on how to construct regions of smaller volume.We show how a set can be minimally enlarged in order to comply with the imposed shape condition, allowing to start the construction with sets of arbitrary shapes.
In the consistently tuned case, we give a simple asymptotic confidence regions in the shape of a parallelogram that is determined by the regressor matrix.
We start the proof section with introducing some notation that will be used throughout this section.Let e j denote the j th unit vector in R p and let ι = (1, . . ., 1) ∈ R p .For a vector d ∈ {−1, 1} p , we define O d to be the corresponding orthant of R p , that is, O d = {z ∈ R p : d j z j ≥ 0} and Ōd to be the corresponding orthant of R p , that is, Ōd = {z ∈ R p : d j z j ≥ 0}.By O ι int we denote the orthant with strictly positive components only, that is, O ι int = {z ∈ R p : z j > 0}.The sup-norm on R p is denoted by .∞ .To remind the reader of some notation relevant for the following proofs that was introduced previously throughout the paper, note that ûn = n 1/2 ( βL − β), where ûn is the minimizer of Q n , and ûLS = n 1/2 ( βLS − β).The minimizer of Q d n was labeled ûd n .The asymptotic versions in the conservatively tuned case were labeled û and Q, as well as ûd and Q d , respectively.
The directional derivative of a function g : R p → R at u in the direction of r ∈ R p \ {0} is defined as

Proofs for Section 3
In order to prove the main theorem, we start by re-writing Condition A. For m ∈ R p and a p × p matrix C, we define for j = 1, . . ., p.Note that clearly we have and that, in fact, also the following lemma holds.
Proof.We fix m and C, drop the corresponding subscripts and show that the set on the of the equation contains the set on the right-hand side of the equation.To this end, take any z from the set on right-hand side.Then there exists a d ∈ {−1, 1} d such that for each j = 1, . . ., p, z is either contained in Then, by construction, z ∈ A f j for all j = 1, . . ., p and therefore z ∈ j A f j so that z is contained in the set on the left-hand side of the equation.
Since needed later on, we also prove the following proposition which quantifies the maximal distance between the Lasso and the LS estimator in finite samples.
Proof.The two inequalities above just differ by a scaling factor.We show the latter one.We have Consider the directional derivative of Q n at its minimizer ûn in the direction of e j and −e j .We have as well as Piecing the two displays above together yields the second inequality in the proposition.
To proceed note that Q d n as defined in ( 1) is a simple quadratic and strictly convex function in u with unique minimizer ûd n given by ûd where W n ∼ N (0, σ 2 C n ).We first show Theorem 1 for one orthant of the parameter space R p , as is formulated in Proposition 14.
In essence, Proposition 14 states Theorem 1 for the orthant of the parameter space where all components of β are non-negative.The condition in Proposition 14 takes the role of Condition A for the corresponding orthant, as will become apparent later on in the proof of Theorem 1.
Proof of Proposition 14.We first show that inf β∈O ι P β (û n ∈ M n ) ≥ P (û ι n ∈ M n ) by showing that for each fixed ω ∈ Ω, ûι n ∈ M n implies that ûn ∈ M n as long as β j ≥ 0 for all j.For this, we first show the following two facts.
(a) (C n ûι n ) j ≤ (C n ûn ) j for all j = 1, . . ., p. Suppose there exists a j 0 with such that (C n ûι n ) j0 > (C n ûn ) j0 and note that by (5) we have (C n ûι n ) j = W n,j − n −1/2 λ n,j for each j = 1, . . ., p .Now consider the directional derivative of Q n at its minimizer ûn in direction e j0 , which is a contradiction to ûn minimizing Q n .
Essentially, we have now shown the main theorem for one part of the parameter space R p .By flipping signs, we can apply Proposition 14 to each orthant O d , thus obtaining the formula for the infimal coverage over the whole space.
Proof of Theorem 1.First note that Thus, if we can show that inf for each d ∈ {−1, 1} p , the proof is done.Now, fix d and set D = diag(d).We consider the function where Cn = DC n D, Wn = DW n ∼ N (0, σ 2 Cn ).We write ũn for the minimizer of Qn , and, analogously to Section 3, we define ũι n to be the minimizer of the function u Cn u − 2u Wn + 2n −1/2 p j=1 λ n,j u j .If we can show that the set DM n satisfies the requirement of Proposition 14 with the matrix Cn in place of C n , we may conclude that Note that ûn = Dũ n , ûd n = Dũ ι n and D −1 = D, so that inf which proves the formula for the infimal coverage probability.We now show that the set DM n satisfies that  Proof of Proposition 4. We transform the ellipse to a sphere and the corresponding normal distribution to have independent components with equal variances. where So clearly, the smallest probability will be achieved for the distribution with mean furthest away from the origin, which is any Proof of Proposition 5. We start by showing that for any m ∈ R p , d ∈ {−1, 1} p , we have Let z ∈ A d C (y).Then d j z j ≤ 0 and ( Cy) j ≤ ( Cz) j for all j.But since y ∈ A d C (m), we also have ( Cm) j ≤ ( Cy) j for all j so that that ( Cm) j ≤ ( Cz) j for all j and therefore z ∈ A d C (m), thus proving (6).So clearly, the set
Since λ n /n 1/2 converges, we have B n ⊆ C −1 n Bδ with Bδ = {x ∈ R p : x ∞ ≤ δ} for some δ > 0. Since C −1 n → C −1 , the set {C −1 n : n ∈ N} is bounded in operator sup-norm by Banach-Steinhaus, so that the set B n is uniformly bounded over n in sup-norm by, say, γ > 0. We now fix a component j and show that lim inf n→∞ inf β∈R p P β ( βL,j = 0) > 0. To this end, define R j = R j−1 × {0} × R p−j .Let ξ 2 j,n and ξ 2 j be the positive j th diagonal element of C −1 n and C −1 , respectively.Observe that inf β∈R p P β ( βL,j = 0) ≥ inf β∈R p P β ( βLS − Since V n and V ζ are strictly convex and V ζ is non-random, it follows by Geyer (1996) that also the corresponding minimizers converge in probability to the minimizer of the limiting function.
Proof of Proposition 10.The equality of the two sets given in the display of Proposition 10 is trivial.We show that the set M as defined in ( 4) is equal to the set on the left-hand side and start by proving that M is contained in that set.Take any m ∈ M, by definition, there exists a ζ ∈ R p so that m is the minimizer of V ζ .We need to show that |(Cm) j | ≤ λ 0,j for all j.Assume that |(Cm) j0 | > λ 0,j0 for some 1 ≤ j 0 ≤ p.If (Cm) j0 > λ 0,j0 we consider the directional derivative of V ζ at its minimizer m in the direction of −e j0 to get Since the directional derivative is non-negative in any direction r ∈ R p \ {0} and V ζ is (strictly) convex, m must be the minimizer.
Proof of Corollary 11.We start with the case d > 1.Let c = lim inf n→∞ inf β∈R p P β (β ∈ βL − dλ * n M/n).By definition, there exists a subsequence n k and elements β n k ∈ R p such that as k → ∞.Note that dM = {m ∈ R p : |(Cm) j | ≤ dλ 0,j , 1 ≤ j ≤ p}.Now, pick a further subsequence n k l such that λ * defined as the set {a + b : b ∈ B}.For a p × p matrix C and a scalar c, the sets CB and cB in R p are { Cb : b ∈ B} ⊆ R p and {cb : b ∈ B} ⊆ R p , respectively.Finally, for k ∈ N, I k stands for the k × k identity matrix and R denotes the extended real line R ∪ {−∞, ∞}.

Figure 3 :
Figure 3: (a) Construction of the alternative shape based on 2 p = 4 ellipses with their centers displayed as dots.(b) The resulting improved confidence set with the alternative shape (blue) and the previous elliptic shape (red), both based on at the Lasso estimator βL = (1.15,0) .
Now, by Facts (a) and (b) we clearly have that ûn ∈ A Cn,j (Dm) ∪ B ι Cn,j (Dm) ⊆ DM n for all m ∈ M n .A straightforward calculation shows that this is equivalent to (m) ⊆ M n for each m ∈ M which clearly holds by Condition A and Proposition 12.The distributional result on ûd n immediately follows by (5).

7. 2
Proofs for Section 4 Proof of Proposition 3. Let m ∈ E Cn (k) and y ∈ A d Cn (m).We show that y ∈ E Cn (k).Remember that D = diag(d) satisfies DD = I p .Since y ∈ A d Cn (m) we have −Dy ∈ O ι and −DC(m − y) ∈ O ι implying that y C(m − y) = (Dy) DC(m − y) ≥ 0. Furthermore, since (m − y) C(m − y) ≥ 0, we have m C(m − y) ≥ y C(m − y) ≥ 0, which in turn yields m Cm ≥ m Cy ≥ y Cy ≥ 0. But this means that k ≥ m Cm ≥ m Cy ≥ y Cy and therefore y ∈ E Cn (k).
. For each m ∈ M , choose d ∈ {−1, 1} p in such a way that d j = 1 if m j = 0 and d j = − sgn(m j ) for m j = 0. We then get m ∈ A d C (m), implying that the set in the display above actually contains M .
l /n k l converges in R p to, say, ζ.Proposition 9 then shows that n k l ( βL − β n k l )/λ * n k l converges in probability to the unique minimizer of V ζ as l → ∞.Finally, Proposition 10 implies that c = 1.We next look the case where d < 1.Let m = C −1 λ 0 so that m ∈ M \ dM.From the proof of Proposition 10, we know that for ζ = −m we have m = arg min u∈R p V ζ (u).Let β n = nζ/λ * n .By Proposition 9, n( βL − β n )/λ * n converges to m in P βn -probability, so that P βn (n( βL − β n )/λ * n ∈ dM) → 0.