On the Distribution, Model Selection Properties and Uniqueness of the Lasso Estimator in Low and High Dimensions

We derive expressions for the finite-sample distribution of the Lasso estimator in the context of a linear regression model in low as well as in high dimensions by exploiting the structure of the optimization problem defining the estimator. In low dimensions, we assume full rank of the regressor matrix and present expressions for the cumulative distribution function as well as the densities of the absolutely continuous parts of the estimator. Our results are presented for the case of normally distributed errors, but do not hinge on this assumption and can easily be generalized. Additionally, we establish an explicit formula for the correspondence between the Lasso and the least-squares estimator. We derive analogous results for the distribution in less explicit form in high dimensions where we make no assumptions on the regressor matrix at all. In this setting, we also investigate the model selection properties of the Lasso and show that possibly only a subset of models might be selected by the estimator, completely independently of the observed response vector. Finally, we present a condition for uniqueness of the estimator that is necessary as well as sufficient.


Introduction
The distribution of the Lasso estimator (Tibshirani, 1996) has been an object of study in the statistics literature for a number of years. The often cited paper by Knight & Fu (2000) gives the asymptotic distribution of the Lasso in the framework of conservative model selection in a low-dimensional (fixed-p) framework by listing the limit of the corresponding stochastic optimization. Pötscher & Leeb (2009) derive explicit expressions of the distribution in finite samples as well as asymptotically for all large-sample regimes of the tuning parameter ("conservative" as well as "consistent model selection") in the framework of orthogonal regressors. More recently, Zhou (2014) gives high-level information on the finite-sample distribution for arbitrary designs in low and high dimensions, geared towards setting up a Monte-Carlo approach to infer about the distribution. In Ewald & Schneider (2018), the large-sample distribution of the Lasso is derived in a low-dimensional framework for the large-sample regime of the tuning parameter not considered in Knight & Fu (2000). Moreover, Jagannath & Upadhye (2018) consider the characteristic function of the Lasso to obtain approximate expressions for the marginal distribution of one-dimensional components of the Lasso when these components are "large", therefore not having to consider the atomic part of the estimator.
In this paper, we exactly and completely characterize the distribution of the Lasso estimator in finite samples in the context of a linear regression model with normal errors. In low dimensions, we give formulae for the cumulative distribution function (cdf), as well as the density functions conditional on which components of the estimator are non-zero. We do so assuming full column rank of the regressor matrix. Our results do not hinge on the normality assumption of the errors, but can easily be extended to more general error distributions. We also exactly quantify the correspondence between the Lasso and least-squares (LS) estimator, depending on the regressor matrix and tuning parameters only.
In a high-dimensional setting, we make absolutely no assumptions on the regressor matrix. We give formulae for the probability of the Lasso estimator falling into a given set and exactly quantify the relationship between the Lasso estimator and the data object X y. Through this relationship, we also learn that the Lasso may never select certain models, this property depending only on the regressor matrix and the penalization weights and being independent of the observed response vector. In fact, we can characterize a so-called structural set that contains all covariates that are part of a Lasso model model for some response vector. This structural set can be identified by how the row space of the regressor matrix intersects a cube centered at the origin whose side lengths are determined by the penalization weights. The set may not contain all indices, in which case the Lasso estimator will rule out certain covariates for all possible observations of the dependent variable. This is related to the idea of so-called SAFE rules  that can discard covariates for Lasso solutions for a fixed value of the dependent variable.
Finally, we present a condition for uniqueness of the Lasso estimator that is both necessary and sufficient, again related to how the row space of the regressor matrix intersects the above mentioned cube. Previously, only a sufficient condition for uniqueness has been known (see e.g. Tibshirani, 2013, Ali & Tibshirani, 2019. The results quantifying the relationship between the Lasso and the LS estimator or X y, respectively, are in fact completely independent of the error distribution and merely utilize the given values of the dependent variable and the regressor matrix. The results on model selection properties and uniqueness use the regressor matrix and penalization weights only.
The paper is organized as follows. We introduce the setting and notation in Section 2 and state basic results used throughout the paper. The low-dimension case is treated in Section 3, whereas we consider the high-dimensional case in Section 4. We conclude in Section 5.

Consider the linear model
where y is the observed n×1 data vector, X is the n×p regressor matrix which is assumed to be nonstochastic, β ∈ R p is the true parameter vector and ε the unobserved error term. We assume that X ε follows a N (0, σ 2 X X)-distribution with σ 2 > 0, which, for instance, is the case when the components of ε are independent and identically distributed according to a N (0, σ 2 )-distribution. Our results depend on the distribution of X ε only, and we chose the normal distribution for presentation purposes. 1 We consider the weighted Lasso estimatorβ L , defined as a solution to the minimization problem where λ n,j , are non-negative user-specified tuning parameters that will typically depend on n. To ease notation, we shall suppress this dependence for the most part and write λ n,j = λ j for each j. Note that if λ j = 0 for all j, the weighted Lasso is equal to LS estimation, and that λ 1 = · · · = λ p > 0 corresponds to the classical Lasso estimator as proposed by Tibshirani (1996), to which case we also refer to by uniform tuning. For later use, let λ = (λ 1 , . . . , λ p ) and define M 0 = {j : λ j = 0}, the index set of all unpenalized coefficients. If M 0 = ∅, we speak of partial tuning. Note that M 0 contains the indices of covariates that will be part of any model chosen by the Lasso. We stress dependence on the unknown parameter β when it occurs, but do not specify dependence on X, y or λ as these quantities are available to the user.
The following notation will be used throughout the paper. Let φ (µ,Σ) denote the Lebesguedensity of a normally distributed random variable with mean µ and covariance matrix Σ, and let Φ be the cdf of a univariate standard normal distribution. For a vector m ∈ R p and an index set I ⊆ {1, . . . , p}, the vector m I ∈ R |I| contains only the components of m corresponding to the elements of I. We write |I| for the cardinality of I, and I c for {1, . . . , p} \ I, the complement of I. The 1-norm of m is denoted by m 1 whereas the 2-norm is simply denoted by m . For x ∈ R, let sgn(x) = 1 {x>0} − 1 {x<0} where 1 is the indicator function. For a set A ⊆ R p , the set m + A = A + m is defined as {m + z : z ∈ A}, with a analogous definitions for A − m and m − A.
We denote the Cartesian product by and the kernel, column space, and rank of a matrix C by ker(X), col(C), and rk(C), respectively. The columns of C are denoted by C j whereas C I , for some index set I, is the matrix containing the |I| columns of C corresponding to indices in I only. For a quadratic matrix C, |C| denotes the determinant of C. We use R >0 for the positive, and R ≥0 for the non-negative real numbers.
Let {D − , D 0 , D + } be a partition of {1, . . . , p} into three sets, some of which may be empty. It will be convenient to also describe this partition by a vector d ∈ Finally, we state the conditions that characterize solutions to (2), known as the Kuhn-Karush-Tucker (KKT) conditions for the Lasso (see e.g. Tibshirani, 2013, with the slight adaptation that we use componentwise tuning) which form the basis of most proofs and will be used throughout the article.
Plugging in y = Xβ + u, we rewrite this as

The Low-dimensional Case
Throughout this section, we assume that X has full column rank p, implying that we are considering the low-dimensional setting where p ≤ n. This assumption is used in the following through the fact that W = X ε follows a non-degenerate normal distribution on R p in the distributional statements onβ L (Theorem 3, Corollary 5). It is additionally used through the fact that the true parameter is properly identified in the distributional statements concerning the estimation errorβ L − β (Corollary 4, Proposition 6, Theorem 7). We also rely on this assumption in Theorem 8 through the invertibility of X X and the existence of the LS estimator.
where m β and s λ ∈ R p are given by (m Proof. We need to compute the probability of the event By Corollary 2, this event is equivalent to the event {W ∈ A β (B z )} with A β (B z ) = ∪ b∈Bz A β (b) and A β (b) as defined in Corollary 2. As W = X ε ∼ N (0, σ 2 X X), the probability we are looking for is therefore given by We now look at the structure of the set A β (B z ) more concretely. Since sgn(b) = sgn(z) for all b ∈ B z , we get which, after applying the substitution w = X Xm β + s λ , yields the claim.
Theorem 3 gives the distribution ofβ L . The dependence on the unknown parameter β arises in the shift (m β ) D0 = −β D0 as well as in the limits for the variables of integration m D+ and m D− . In case the regressors are orthogonal, more concretely, if X X = I p , the probability expression in Theorem 3 can be written as , which is consistent with the well-known fact that the Lasso is equivalent to componentwise soft-thresholding in this case, also treated in Pötscher & Schneider (2009).
The distribution of the estimation errorû =β L − β can now be derived from Theorem 3 as a corollary.
Another direct consequence of Theorem 3 is a concrete formula for the probability of the extreme event of the Lasso setting all components equal to zero.
Remark 1. To illustrate the structure behind the proof of Theorem 3, note that the equivalence of the events {β ∈ B z } and {W ∈ A β (B z )} is shown through through Corollary 2. The equivalence holds due to the structure of the optimization problem definingβ L and does not depend on the distribution of W = X ε. In this sense, the distributional results do not hinge on the normality assumption of the errors and can easily be generalized to other error distributions. The relationship and shape of the sets B z and A β (B z ) is illustrated in Figure 1. Note that A β depends on λ, whereas B z does not.  Remark 2. Theorem 3, Corollary 4, Corollary 5, Proposition 6 and Theorem 7 do not rely on the normal distribution, as just mentioned. Indeed, the results equally hold for any other absolutely continuous distribution of X ε (with respect to Lebesgue measure), only the expression φ (0,σ 2 X X) would have to be replaced by the corresponding density function of X ε. Moreover, the results also hold for discrete X ε in which case the integral would have to be replaced by a sum, and the density function by the corresponding probability mass function.
Theorem 3 now puts us into a position to fully specify the distribution of the Lasso estimator. In case λ j > 0 for all j, one easily sees from the preceding corollary that this distribution is not absolutely continuous with respect to the p-dimensional Lebesgue-measure, and thus no density exists. One can, however, represent the distribution through Lebesgue-densities after conditioning on which components of the estimator are negative, equal to zero, and positive, which we shall do in the sequel.
can be calculated using Corollary 3.
Proof. Observe that and note that by Corollary 4, for any z ∈ R p with z + β ∈ O d we have Differentiating with respect to z j with j ∈ D ± , and taking the absolute value gives the density, thus completing the proof.
Besides the conditional densities, we can also specify the full cdf ofû =β L − β which is done in the following theorem.
Theorem 7. The cdf ofû =β L − β is given by Proof. It is easily seen that Plugging in the formula for f d completes the proof.
For illustration of Proposition 6 and Theorem 7, consider Figures 2 and 3 which display an example of the distribution ofû =β L − β. One can see that the Lasso estimation error follows a shifted normal distribution, conditional on the eventû j = −β j (β L,j = 0) for each j, with the shift depending on the signs ofβ L , as can be seen in Figure 2. Figure 3 displays the mass which lies on the set {z ∈ R 2 : z 1 = −β 1 , z 2 = 0}, that is, the density functions h (0,1) and h (0,−1) on their corresponding domains. The mass on the set {z ∈ R 2 : z 1 = 0, z 2 = −β 2 } looks qualitatively similar to Figure 3. Note that we also have point-mass at −β, as is pointed out by Corollary 5.

Shrinkage Areas
Using the conditions for minimality from Lemma 1, we can establish a direct relationship between the LS and the Lasso estimator in the following sense. For any b ∈ R p , there exists a set S(b) ⊆ R p , such that the Lasso estimator assumes the value b if and only if the LS estimator lies in S(b). We refer to the set S(b) as shrinkage area since the Lasso estimator can be viewed as a procedure that shrinks the LS estimates from the set S(b) to the point b. Note that by shrinkage, we mean that b 1 ≤ z 1 for each z ∈ S(b), but |b j | > |z j | could hold for certain components. The explicit form of S(b) is formalized in the following theorem.
Clearly, the sets S(b) are disjoint for different b's.
The sets are disjoint since, in case rk(X) = p, all Lasso solutions are unique. If S(b) ∩ S(b) = ∅ holds for some b =b, we can find y ∈ R n such thatβ LS = (X X) −1 X y ∈ S(b) ∩ S(b) implying that both b andb are Lasso solutions for the given y, yielding a contradiction.
whereλ j = sgn(b j )λ j for j = 1, . . . , p. This implies that, in caseβ L,j = 0 for all j, the Lasso estimator is given byβ Note that aside from b, S(b) depends on X and λ only.
Given Theorem 8, we can identify areas in which components of the LS estimator are shrunk to zero by the Lasso. For p = 2, it leads to the image displayed in Figure 4. Clearly, the shrinkage areas are related to the polyhedral selection areas developed in  and employed for instance in Lee et al. (2016), but yield a different kind of information. Our results identify the regions of the LS estimator that lead to a particular value b of the Lasso estimator. The polyhedral regions in the above articles identify the regions of the dependent variable y that correspond to a particular Lasso model with specific signs of the active coefficients. Naturally, our regions are subsets of R p , while the polyhedral regions are subsets of R n (the latter ones also allowing to interpret the Lasso fit as a projection, and not depending on full column rank).

High-Dimensional Case
We now turn to the main case of this this article, the high-dimensional setting where p > n. We make no assumptions on the regressor matrix X in this section. Using similar arguments as in the case p ≤ n, we can again start by characterizing the distribution of the Lasso, albeit in a somewhat less explicit form. Note that we have rk(X) < p and that the true parameter is not identified without further assumptions. We denote by B 0 the set of all β ∈ R p that yield the model given in (1), that is, B 0 = {β ∈ R p : Xβ = E(y) = µ}. Furthermore, it is important to note that the Lasso solution need not be unique anymore. We give necessary and sufficient conditions for uniqueness later in Section 4.3. All findings in this section also hold when p ≤ n, but more explicit results for this case are found in Section 3. We start with a high-level result on the distribution ofβ L , which immediately follows from Corollary 2.
Theorem 9. For any set B ⊆ R p and any β ∈ B 0 , we have In particular, the distribution of the estimatorβ L does not depend on the choice of β ∈ B 0 .
To derive the analogue of the distribution of the estimation error in high dimensions for a fixed β ∈ B 0 , define the function V β (u) = L(u + β) − L(β) given by which is minimized atβ L − β, and whereβ L may be any minimizer of L(β). The high-dimensional version of Corollary 4 can now be formulated as Theorem 10. For any set M ⊆ R p and any β ∈ B 0 , we have where W ∼ N (0, σ 2 X X) andĀ β (M ) = m∈MĀ β (m) withĀ β (m) = X Xm + p j=1B β,j (m j ) and Proof. As stated above, m ∈ R p is a minimizer of V β if and only if m + β is a minimizer of L(β).
Corollary 2 then yields m ∈ arg min Remark 4. Note that, just as for the low-dimensional case discussed in Remark 2, the statements in Theorem 9 and Corollary 10 do not hinge on the normal distribution of W = X ε. In fact, the both results equally hold for arbitrary distributions of W .
While the distribution ofβ L − β depends on the choice of β ∈ B 0 , the distribution ofβ L does not, as it is determined by y ∼ N (µ, σ 2 I n ). This is further formalized in the following corollary. As mentioned before,β L need not be unique. Also remember thatβ L itself minimizes the function L(β) defined in (2).
As the random variable W = X ε has singular covariance matrix, some care needs to be taken when computing the probability from Corollary 10 through the appropriate integral of the corresponding density function.
Corollary 11. Let the columns of U form a basis of col(X ). The probability that a Lasso solution lies in the set B ⊆ R p can be written as Proof. Note that U W ∼ N (0, σ 2 U X XU ) and that U X XU is invertible. Let N be a matrix whose columns form a basis of col(X ) ⊥ , so that N W has covariance matrix σ 2 N X XN = 0, yielding N W = 0 almost surely. We therefore have which proves the claim.

Selection Regions and Model Selection Properties
In the low-dimensional case, Theorem 8 gives what we call shrinkage areas of the Lasso with respect to the LS estimator. As the latter is never uniquely defined in the high-dimensional case, we instead look at the object X y and and consider so-called selection regions with respect to this quantity: for any b ∈ R p , we provide a set T (b) such that a Lasso solution is equal to b if and only if X y lies in the set T (b). The corresponding result turns out to be a restatement of Lemma 1, which we list again in the following for the sake of completeness.
Theorem 12. For each b ∈ R p there exists a set T (b) ⊆ R p such that b ∈ arg min β∈R p L(β) ⇐⇒ X y ∈ T (b).

Moreover, T (b) is given by
Remark 5. Analogously to the low-dimensional case, the sets T (b) are singletons if b ∈ R p satisfies b j = 0 for all j = 1, . . . , p: whereλ j = sgn(b j )λ j for j = 1, . . . , p. Also, aside from b, the sets T (b) depend on X and λ only.
Inspecting the sets T (b) from Theorem 12 more closely, we see that they are, in general, not disjoint for different values of b ∈ R p . This illustrates the fact that, in contrast to the low-dimensional case, the Lasso solution need not be unique in high dimensions anymore. Indeed, we can have T (b) ∩ T (b ) = ∅, as long as b − b ∈ ker(X) and {sgn(b j ), sgn(b j )} = {−1, 1} for all j. This also makes apparent that b and b may be Lasso solutions not corresponding to the same model, which has been noted by Tibshirani (2013) for the case of λ 1 = · · · = λ p > 0. We get deeper into the issue of (non-)uniqueness in Section 4.3.
Theorem 12 also sheds some light on which models M ⊆ {1, . . . , p} may in fact be chosen by the Lasso estimator, where the Lasso model is given by {j :β L,j = 0}. We find that some models will, in fact, never be selected by the Lasso. This is illustrated in Figure 5 below, where the Lasso always sets the first component to zero, independently of y. This leads to the question on how to determine whether a particular model M may or may not be chosen. Figure 5: The selection regions with respect to X y from Theorem 12, with X = (1, 2) and λ 1 = λ 2 = 1 from Example 1. Displayed in red is col(X ), the area on which the probability mass of X y is concentrated. The set T ( 0 0 ) is displayed in blue, while the parallel light gray lines represent the sets T 0 b2 with b 2 = 0, and the parallel dark gray lines are the sets T b1 Looking at the definition of T (b) in Theorem 12, and noting that X Xb ∈ col(X ), we can deduce the following corollary.
Corollary 13. Let X ∈ R n×p and λ ∈ R p ≥0 be given. There exist y ∈ R n such that a corresponding Lasso solution selects model M ⊆ {1, . . . , p} if and only if A model that may be chosen by the Lasso is called accessible in Sepehri & Harris (2017). (This reference who also provides a condition for when this is the case. The difference is that uses geometric considerations in R n under a uniqueness assumption, whereas our approach operates in R p with no assumptions on X.) The sets B M are made up of the faces of the λ-cube. If M 0 = ∅, B ∅ is the p-dimensional λ-box, B {j} is the union of two opposite facets of the λ-box, and for 1 < |M| < p, B M is a union of (p − |M|)-dimensional faces of the λ-box. Finally, B {1,...,p} simply contains the corners of the λ-box. These sets are illustrated in Figure 6 below. so that, not surprisingly, there always exist y such that the non-penalized components will be part of the model chosen by the Lasso solution.
Example 2. To look at a more complex example, suppose now that X = 2 0 1 1 2 0 , so that n = 2, p = 3. Let λ 1 = λ 2 = λ 3 =λ (uniform tuning). We have for allλ > 0, so that by Corollary 13 To say something about the distribution of the remaining components, note that the estimator is equivalent to the low-dimensional procedure using the matrixX = X {1,2} , which contains the first and second regressor only. Letβ ∈ R 2 be such thatXβ = Xβ, where Xβ = E(y), and let whereλ will be specified below. We can now use Theorem 3 to find the following. The absolutely continuous parts of the distribution of (β L,1 ,β L,2 ) can be determined by for z 1 , z 2 < 0 andλ = (−λ, −λ) . Analogously, we get for z 1 , z 2 > 0 andλ = (λ,λ) . Moreover, for z 1 < 0, z 2 > 0 andλ = (−λ,λ) , as well as for z 1 > 0, z 2 < 0 andλ = (λ, −λ) . This shows that the absolutely continuous parts of the estimator follow a normal distribution with the same covariance matrix as the LS estimator and a shift in expectation that depends on the regressor matrix as well as the tuning parameters. These findings are in line with Remark 3.
In both examples above, the distribution ofβ L is the same as the one of a Lasso estimator in a smaller model. This fact is, of course, only valid for the specific forms of X and λ considered here. The models considered by the Lasso do not depend on β and ε in the sense that certain values of X and λ may immediately rule out certain models, completely independently of y. (The choice between the accessible models does, of course, very much depend on β and ε.) Sparked by Examples 1 and 2, this suggests that in the high-dimensional setting, model selection by the Lasso estimator may possibly not be a purely data-driven procedure insofar as there is a structural model or structural set M ⊆ {1, . . . , p}, determined by X and λ only, that satisfieŝ β L,j = 0 for any j / ∈ M for all observations y ∈ R n . In particular, the true parameter β ∈ B 0 , as well as the distribution of ε do not have any influence on this set. In other words, some models are never considered by the model selection procedure, completely independently of the data vector y. Put yet differently again, for a given regressor matrix X, one can restrict or choose this class of models by choice of λ.
Given all the considerations above, one might ask whether such a structural model M always satisfies |M | ≤ n under certain conditions. Clearly, uniqueness would be a meaningful requirement in this context, as then all Lasso solutions will choose models of cardinality of at most n, as has been shown in Tibshirani (2013) 2 . In that case, the Lasso estimator would be equivalent to a low-dimensional Lasso procedure, restricted to this structural model M , and we could employ results from low-dimensional settings also for inference in high-dimensional models, such as Ewald & Schneider (2018) for constructing confidence regions.
In Examples 1 and 2, the Lasso solutions are always unique. It is not difficult, however, to construct an example where the solutions are not unique anymore.
Example 3. Again, take the model from Example 1 with X = (1, 2). This time, choose λ = (1, 2) (non-uniform tuning). It can easily be seen using Theorem 9 that for each y < −1, all are Lasso solutions for the same value of y. Similarly, for y > 1, all are Lasso solutions for the same value of y. (Note thatβ L = 0 for all y with |y| ≤ 1.) The corresponding selection regions are illustrated in Figure 7.
Example 3 shows an already known property of the Lasso from another perspective: The solution to the Lasso problem is, in general, not unique. Moreover, if the solution is not unique, then, by convexity of the problem, there exists an uncountable set of solutions 3 . The example moreover shows that the set of y which yield non-unique Lasso solutions is not a null set. In fact, in this example, it occurs with probability 2Φ(−1). Figure 7: The selection regions with respect to X y from Theorem 12, with X = (1, 2) and λ 1 = 1, λ 2 = 2 from Example 3. Displayed in red is col(X ), the area on which the probability mass of X y is concentrated. The set T ( 0 0 ) is displayed in blue, while the parallel light gray lines represent T 0 b2 with b 2 = 0, and the parallel dark gray lines are T b1 0 with b 1 = 0. The yellow lines consist of the singletons T b1 b2 with b 1 , b 2 = 0. The red line passes through T ( 0 0 ) where the solution is unique but also through the line where the light gray, the dark gray and the yellow areas intersect.
Of course, this problem could be overcome by slightly altering the choice of the tuning parameters, even though this would imply to make a choice of the class of models under consideration, as pointed out previously in this section.

Structural Sets
Clearly, Example 3 shows that the structural set may be equal to the entire set of explanatory variables. It is easy to see that for n = 1 and p = 2, the Lasso estimator will always have a structural set with cardinality n = 1 whenever we have uniqueness. The question is, of course, whether the same can be said in more generality. Before answering this question, we show how the structural set can be determined given X and λ, by counting how many sets of parallel facets of the λ-box B ∅ are intersected by col(X ).
Theorem 14. Let X ∈ R n×p and λ ∈ R p ≥0 be given. Let M be the so-called structural set of X and λ that contains all j ∈ {1, . . . , p}, such that there exist y ∈ R n so that a corresponding Lasso solutionβ L satisfiesβ L,j = 0, that is, M contains all regressors that are part of a Lasso solution for some observation y. This set is given by Theorem 14 shows that in order to determine the structural set, only the intersection of col(X ) with the (p − 1)-dimensional faces, the so-called facets of the λ-cube, have to be considered. A strategy how to determine the structural set for a given X might be the following. Note that col(X ) = ker(X) ⊥ and find vectors V 1 , . . . , V k ∈ R p that span ker(X), where k = p − rk(X). Let V = (V 1 , . . . , V k ) ∈ R p×k and check whether V s = 0 is solvable for s ∈ B {j} , where If this is the case, then j ∈ M , otherwise j / ∈ M . So, determining the structural set amounts to identifying a basis of ker(X), and solving a linear system in k = p − rk(X) equations and p unknowns. After that, we have to check whether the resulting solution set contains any elements of B {j} for j = 1, . . . , p. This approach is employed in Example 5 in the subsequent section.
We would like to point out the difference between the idea of a structural set and results concerning SAFE rules, such as Ghaoui et al. (2012),  and Ndiayee et al. (2017). Based on a SAFE rule, a regressor will be discarded by a Lasso solution for a given observation y. In contrast, if a covariate is not contained the the structural set, it will be excluded from the Lasso model for all observations y. This, on the one hand, implies that the result of Theorem 14 is much cruder than a safe rule, excluding (if any) less regressors. On the other hand, since the structural set is entirely independent of y, the corresponding Lasso problem can equivalently be viewed as a Lasso problem using covariates from M only, also regarding distributional results and inference. In particular, if |M | ≤ n, we can consider the low-dimensional Lasso problem using X M as regressor matrix. If X M has full rank, we can then use results from Ewald & Schneider (2018) to construct confidence regions. One has, of course, to be aware that inference is now on the parameter β satisfying Xβ = X Mβ , as exemplified in Example 2.
Remark 6. As indicated in Theorem 14 and as discussed above, the structural set M depends on X and λ only. Moreover, it can easily be seen that it depends on the tuning parameters λ only through the penalization weighting in the sense that whenever λ =λω for someλ > 0 and ω ∈ R p ≥0 , M (X, λ) = M (X, ω) follows. This implies that, in particular, in the common case of uniform tuning withλ = λ 1 = · · · = λ p , the structural set only depends on X! Coming back to the conjecture whether the structural set always satisfies |M | ≤ min{n, p} in case the solutions are unique, using Theorem 14, we can list the following simple example with n = 2 and p = 3 to show that this cannot be the case in general. However, note that Theorem 14 allows to compute the structural set and that whenever |M | ≤ n, the resulting Lasso estimator is, in fact, just equivalent to a low-dimensional procedure. Finally, it is important to note that Theorem 14 also reveals that M contains all regressors if the columns of X are scaled to have unit length and the components are tuned uniformly: For s =λX X j ∈ col(X ), by the Cauchy-Schwarz inequality, we have s ∈ B {j} also, leading to j ∈ M .
Example 2 (continued). If we rescale the columns of X from Example 2, we obtaiñ X =  To view this figure in terms of selection regions, note that the areas corresponding to single-regressor models are displayed in gray, while the selection regions that correspond to two-regressor models are displayed in yellow. The intersection of the λ-cube with col(X ), which corresponds to the zero estimator, is displayed in blue.
The above observation may be seen as an argument for rescaling the regressors before using the Lasso. It may, however, not always be desirable to so, such as in case the explanatory variables are observed in the same units, or in the presence of dummy variables. Also, rescaling the columns may result in changing whether or not the solutions are unique, an issue addressed in the following section.

A Necessary and Sufficient Condition for Uniqueness
We now turn to some results revolving around uniqueness of the Lasso estimator, which can be obtained with the same geometric approach, that is, studying the intersection of the λ-cube with col(X ). Note that by uniqueness, we mean that for a given X ∈ R n×p , and a given λ ∈ R p , the Lasso solutions are unique for all observations y ∈ R n . Tibshirani (2013) showed that for a given regressor matrix X, Lasso solutions are unique in the above sense, if the columns of X are in general position 4 , which occurs when no k-dimensional affine 5 subspace for k < min(n, p) contains more than k + 1 elements of the set {±X 1 , . . . , ±X p }, excluding antipodal pairs (see p. 1463 in Tibshirani, 2013). In fact, the solutions are then unique for all choices of the tuning parameter, provided that all components are tuned equally. As this condition is sufficient, one may ask whether it is also necessary. The answer to this question is, in fact, no, as can easily be seen from the example below.
When can non-unique solutions exist? For a given X ∈ R n×p and λ ∈ R p , this occurs if and only if there exist b,b ∈ R p with b =b and More concretely, by Theorem 12, and since the Lasso fit Xb is always unique 6 , this means that where v ∈ col(X ) ∩ B M for some M ⊆ {1, . . . , p}, andb M c = b M c = 0. Moreover, for j ∈ M \ M 0 , we have sgn(b j ) = sgn(v j ) whenever b j = 0, as well as sgn(b j ) = sgn(v j ) wheneverb j = 0. Note that we therefore have Xb = X M b M = X MbM = Xb, implying that the columns of X M must be linearly dependent. So non-uniqueness occurs only if col(X ) ∩ B M = ∅ for M ⊆ {1, . . . , p} with linearly dependent columns in X M . The following example now immediately shows that the columns of X being in general position is not necessary for uniqueness.
Clearly, the columns are not general position, however, all Lasso solutions are unique for any choice of the tuning parameter, when the components are tuned uniformly: We have col(X ) ∩ B M = ∅ whenever {1, 2} ⊆ M or |M| > 2. This can easily be checked using the fact that v ∈ col(X ) if and only if v w 1 = v w 2 = 0 for ker(X) = col{w 1 , w 2 }. Therefore, the columns of X M are linearly independent for any M that can be chosen by the Lasso, and all Lasso solutions must be unique.
The example above illustrates the commonly known fact that, if a Lasso solution is unique, it will contain at most n non-zero entries. We show that this fact can be sharpened to yield a necessary and sufficient condition for uniqueness of all Lasso solutions in the following way: first, we show that if the solution is unique, it in fact has at most rk(X) ≤ n non-zero components. Second, we prove that this is not only a necessary, but also a sufficient criterion for uniqueness.
Theorem 15 (Uniqueness). Let X ∈ R n×p and λ ∈ R p ≥0 . The Lasso solution is unique for all y ∈ R n if and only if Proof. ( =⇒ ) Assume the condition is not satisfied. Then there exists v ∈ B M with |M| > rk(X) and v = X z for some z ∈ R n . We show that there is a y ∈ R n such that the corresponding Lasso problem is not uniquely solvable.
If X j = 0 for some j ∈ M 0 , we are done as the corresponding coefficient may be arbitrary. Note that X j = 0 for j ∈ M \ M 0 is not possible: since v ∈ col(X ), this would imply v j = 0, but that contradicts v ∈ B M . We therefore assume that X j = 0 for all j ∈ M.
Since |M| > rk(X), there is a column of X M , say X j (X j = 0), that can be written as a linear combination of the other columns. In particular, we can write Then b is a Lasso solution for y = z + Xb since X y = X Xb + X z = X Xb + v ∈ T (b).
We now constructb ∈ R p , withb = b, that is also a Lasso solution for the same y bỹ We therefore get X y = X Xb + v = X Xb + v ∈ S(b) also, implying that both b andb are Lasso solutions for the given y.
( ⇐= ) We now prove the other direction. Assume that there exists y ∈ R n such that non-unique Lasso solutions b =b exist. As discussed above, this implies the existence of v ∈ col(X ) ∩ B M for some M ⊆ {1, . . . , p} with X M b M = X MbM and b M c =b M c = 0, entailing that the columns of X M are linearly dependent. If |M| > rk(X), we are done. If |M| ≤ rk(X), we do the following. Since we have rk(X M ) < |M| ≤ rk(X), we can pick z ∈ R n such that z ∈ col(X M ) ⊥ \ col(X M c ) ⊥ . This is possible since ⇐⇒ rk(X M ) = rk(X), which is not the case. This z satisfies (X z) M = (X M ) z = 0 and (X z) M c = (X M c ) z = 0, so that we can find c ∈ R such thatṽ = v + c X z ∈ BM ∩ col(X ), with M ⊆M and |M| < |M|. As long as |M| ≤ rk(X), repeat the steps above with v =ṽ and M =M.
Note that just as for Theorem 14, the result from the above theorem depends on λ only through the penalization weights, meaning that for any M ⊆ {1, . . . , p}, whenever λ =λω for someλ > 0 and ω ∈ R p ≥0 , we have col(X ) ∩ B M (λ) = ∅ if and only if col(X ) ∩ B M (ω) = ∅ (when indicating the dependence of B M on the tuning parameters).
As mentioned in the preamble of Section 4, Theorem 15 does not require p > n, so that it also covers the low-dimensional case. Clearly, the condition for uniqueness is trivially satisfied if rk(X) = p.

Conclusion
We give explicit formulae regarding the distribution of the Lasso estimator in finite-samples, assuming a Gaussian distribution of X ε. In the low-dimensional case, we consider the cdf as well as the density functions conditional on "active sets" of the estimator. Our results exploit the structure of the underlying optimization problem of the Lasso estimator and do not hinge on the normality assumption. We also explicitly characterize the correspondence between the Lasso and the LS estimator: It is shown that the Lasso estimator essentially creates shrinkage areas around the axes inside which the probability mass of the LS estimator is compressed into lower-dimensional densities that can be specified conditional on the active set of the estimator. As a result, the distribution looks like a pieced-together combination of Gaussian-like densities. Each active set has its own distributional piece with dimension depending on the number of nonzero components, resulting also in point mass at the origin and mass being distributed along the axes.
The form of the distribution is even more intricate in the high-dimensional case, in which the estimator may not be unique anymore. We quantify the relationship between a Lasso solution and the quantity X y (rather than the LS estimator as in the low-dimensional case). We gain valuable insights into the behavior of the estimator by illustrating that some models may never be selected by the estimator: The so-called structural set, that contains all covariates that are part of a Lasso solution for some response vector y, can be computed based on a geometric condition involving the regressor matrix and penalization weights only. In case this structural set has cardinality less than or equal to n, the Lasso is equivalent to a low-dimensional procedure and results from the p ≤ n-framework can be used for inference. We also learn that in case of uniform tuning and the columns of X scaled to unit length, the structural set contains all covariates.
Finally, the previous insights allow us to close a gap in the literature by providing a condition for uniqueness of the Lasso estimator that is both necessary and sufficient.