The Generalized Lasso Problem and Uniqueness

We study uniqueness in the generalized lasso problem, where the penalty is the $\ell_1$ norm of a matrix $D$ times the coefficient vector. We derive a broad result on uniqueness that places weak assumptions on the predictor matrix $X$ and penalty matrix $D$; the implication is that, if $D$ is fixed and its null space is not too large (the dimension of its null space is at most the number of samples), and $X$ and response vector $y$ jointly follow an absolutely continuous distribution, then the generalized lasso problem has a unique solution almost surely, regardless of the number of predictors relative to the number of samples. This effectively generalizes previous uniqueness results for the lasso problem (which corresponds to the special case $D=I$). Further, we extend our study to the case in which the loss is given by the negative log-likelihood from a generalized linear model. In addition to uniqueness results, we derive results on the local stability of generalized lasso solutions that might be of interest in their own right.


Introduction
We consider the generalized lasso problem minimize β∈R p 1 2 y − Xβ 2 2 + λ Dβ 1 , where y ∈ R n is a response vector, X ∈ R n×p is a predictor matrix, D ∈ R m×p is a penalty matrix, and λ ≥ 0 is a tuning parameter. As explained in Tibshirani and Taylor (2011), the generalized lasso problem (1) encompasses several well-studied problems as special cases, corresponding to different choices of D, e.g., the lasso (Tibshirani, 1996), the fused lasso (Rudin et al., 1992;Tibshirani et al., 2005), trend filtering (Steidl et al., 2006;Kim et al., 2009), the graph fused lasso (Hoefling, 2010), graph trend filtering (Wang et al., 2016), Kronecker trend filtering , among others. (For all problems except the lasso problem, the literature is mainly focused on the so-called "signal approximator" case, where X = I, and the responses have a certain underlying structure; but the "regression" case, where X is arbitrary, naturally arises whenever the predictor variables-rather than the responses-have an analogous structure.) There has been an abundance of theoretical and computational work on the generalized lasso and its various special cases. In this work, we examine sufficient conditions under which the solution in (1) will be unique. While this is simple enough to state, it is a problem of fundamental importance, and our study of uniqueness leads us to develop intermediate properties of generalized lasso solutions that may be of interest in their own right-in particular, when we broaden our focus to a version of (1) where the squared loss is replaced by a general loss function, we derive local stability properties of solutions that seem interesting in their own right.
When p ≤ n and rank(X) = p, there is of course a unique solution in (1) due to strict convexity of the squared loss term. Our focus will hence be in determining sufficient conditions for uniqueness in the high-dimensional case, where rank(X) < p. In the lasso problem, defined by taking D = I in (1), several authors have studied conditions for uniqueness, notably Tibshirani (2013), who showed that when the entries of X are drawn from an arbitrary continuous distribution, the lasso solution is unique almost surely. One of the main results in this paper, on uniqueness in problem (1) for a general D, yields this lasso result as a special case; see Theorem 1, and Remark 6 following the theorem.
It is worth noting that when null(X) ∩ null(D) = {0} problem (1) cannot have a unique solution. This is because if η = 0 lies in this intersection of null spaces, andβ is a solution in (1), then so will beβ + η. Therefore, at the very least, any sufficient condition for uniqueness in (1) must include (or imply) the null space condition null(X) ∩ null(D) = {0}.
In the remainder of this introduction, we describe the implications of our uniqueness results for various special cases of the generalized lasso, discuss related work, and then cover notation and an outline of the rest of the paper.

Uniqueness in special cases
The following is an application of Theorem 1 to various special cases for the penalty matrix D. The takeaway is that, for continuously distributed predictors and responses, uniqueness can be guaranteed almost surely in various interesting cases of the generalized lasso provided that n is not "too small", meaning that the sample size n is not less than the nullity (dimension of the null space) of D. (Note that some of the cases presented in the corollary can be folded into others, but we enumerate each of them for clarity.) Corollary 1. Fix any λ > 0. Assume the joint distribution of (X, y) is absolutely continuous with respect to (np + n)-dimensional Lebesgue measure. Then problem (1) admits a unique solution almost surely, in any one of the following cases: (i) D ∈ R (p−1)×p is the first difference matrix, i.e., fused lasso penalty matrix (see Section 2.1.1 in Tibshirani and Taylor 2011); (ii) D ∈ R (p−k−1)×p is the (k + 1)st order difference matrix, i.e., kth order trend filtering penalty matrix (see Section 2.1.2 in Tibshirani and Taylor 2011), and n ≥ k + 1; (iii) D ∈ R m×p is the graph fused lasso penalty matrix, defined over a graph with m edges, n nodes, and r connected components (see Section 2.1.1 in Tibshirani and Taylor 2011), and n ≥ r; (iv) D ∈ R m×p is the kth order graph trend filtering penalty matrix, defined over a graph with m edges, n nodes, and r connected components (see Wang et al. 2016), and n ≥ r; (v) D ∈ R (N −k−1)N d−1 d×N d is the kth order Kronecker trend filtering penalty matrix, defined over a d-dimensional grid graph with all equal side lengths N = n 1/d (see , and n ≥ (k + 1) d .
Two interesting special cases of the generalized lasso that all outside the scope of our results here are additive trend filtering  and varying-coefficient models (which can be cast in a generalized lasso form, see Section 2.2 of Tibshirani and Taylor 2011). In either of these problems, the predictor matrix X has random elements but obeys a particular structure, and it is not reasonable to assume that its entries overall follow a continuous distribution, thus Theorem 1 cannot be immediately applied. Still, we believe that under weak conditions either problem should have a unique solution.  give a uniqueness result for additive trend filtering by reducing this problem to lasso form; but, keeping this problem in generalized lasso form and carefully investigating an application of Lemma 6 (the deterministic result in this paper leading to Theorem 1) may yield a result with simpler sufficient conditions. This is left to future work.
Furthermore, by applying Theorem 2 to various special cases for D, analogous results hold (for all cases in Corollary 1) when the squared loss is replaced by a generalized linear model loss G as in (19). In this setting, the assumption that (X, y) is jointly absolutely continuous is replaced by the two assumptions that X is absolutely continous, and y / ∈ N , where N is the set defined in (37). The set N has Lebesgue measure zero for some common choices of loss G (see Remark 11); but unless we somewhat artificially assume that the distribution of y|X is continuous (this is artificial because in the two most fundamental generalized linear models outside of the Gaussian model, namely the Bernoulli and Poisson models, the entries of y|X are discrete), the fact that N has Lebesgue measure zero set does not directly imply that the condition y / ∈ N holds almost surely. Still, it seems that y / ∈ N should be "likely"-and hence, uniqueness should be "likely"-in a typical generalized linear model setup, and making this precise is left to future work.

Related work
Several authors have examined uniqueness of solutions in statistical optimization problems en route to proving risk or recovery properties of these solutions; see the references in Tibshirani (2013) for examples of such results in the lasso problem, and Lee et al. (2015) for an example in the generalized lasso problem. These results have a different aim than ours, i.e., their main goal-a risk or recovery guarantee-is more ambitious than certifying uniqueness alone, and thus the conditions they require are more stringent.
For uniqueness results on the lasso, see Tibshirani (2013) and references therein. For uniqueness results on the noiseless lasso problem (and the analogous noiseless 0 -penalized problem), see Donoho (2006) and Dossal (2012). For uniqueness results on the noiseless generalized lasso problem, see Nam et al. (2013).

Notation and outline
In terms of notation, for a matrix A ∈ R m×n , we write A + for its Moore-Penrose pseudoinverse and col(A), row(A), null(A), rank(A) for its column space, row space, null space, and rank, respectively. We write A J for the submatrix defined by the rows of A indexed by a subset J ⊆ {1, . . . , m}, and use A −J as shorthand for A {1,...,m}\J . Similarly, for a vector x ∈ R m , we write x J for the subvector defined by the components of x indexed by J, and use x −J as shorthand for x {1,...,m}\J .
For a set S ⊆ R n , we write span(S) for its linear span, and write aff(S) for its affine span. For a subspace L ⊆ R n , we write P L for the (Euclidean) projection operator onto L, and write P L ⊥ for the projection operator onto the orthogonal complement L ⊥ . For a function f : R m → R n , we write dom(f ) for its domain, and ran(f ) for its range.
Here is an outline for what follows. In Section 2, we review important preliminary facts about the generalized lasso. In Section 3, we derive sufficient conditions for uniqueness in (1), culminating in Theorem 1, our main result on uniqueness in the squared loss case. In Section 4, we consider a generalization of problem (1) where the squared loss is replaced by a smooth and strictly convex function of Xβ; we derive analogs of the important preliminary facts used in the squared loss case, notably, we generalize a result on the local stability of generalized lasso solutions due to Tibshirani and Taylor (2012); and we give sufficient conditions for uniqueness, culminating in Theorem 2, our main result in the general loss case. In Section 5, we conclude with a brief discussion.
(ii) Every solutionβ gives rise to the same fitted value Xβ.
Proof. The criterion function in the generalized lasso problem (1) is convex and proper, as well as closed (being continuous on R p ). As both g(β) = y − Xβ 2 2 and h(β) = λ Dβ 1 are nonnegative, any directions of recession of the criterion f = g + h are necessarily directions of recession of both g and h. Hence, we see that all directions of recession of the criterion f must lie in the common null space null(X) ∩ null(D); but these are directions in which the criterion is constant. Applying, e.g., Theorem 27.1 in Rockafellar (1970) tells us that the criterion attains its infimum, so there is at least one solution in problem (1). Supposing there are two solutionsβ (1) ,β (2) , since the solution set to a convex optimization problem is itself a convex set, we get that tβ (1) + (1 − t)β (2) is also a solution, for any t ∈ [0, 1]. Thus if there is more than one solution, then there are uncountably many solutions. This proves part (i).
Lastly, for part (iii), every solution in the generalized lasso problem (1) yields the same fit by part (ii), leading to the same squared loss; and since every solution also obtains the same (optimal) criterion value, we conclude that every solution obtains the same penalty value, provided that λ > 0.
Next, we consider the Karush-Kuhn-Tucker (or KKT) conditions to characterize optimality of a solutionβ in problem (1). Since there are no contraints, we simply take a subgradient of the criterion and set it equal to zero. Rearranging gives whereγ ∈ R m is a subgradient of the 1 norm evaluated at Dβ, Since the optimal fit Xβ is unique by Lemma 1, the left-hand side in (2) is always unique. This immediately leads to the next result.
Lemma 2. For any y, X, D, and λ > 0, every optimal subgradientγ in problem (1) gives rise to the same value of D Tγ . Moreover, when D has full row rank, the optimal subgradientγ is itself unique.
Remark 1. When D is row rank deficient, the optimal subgradientγ is not necessarily unique, and thus neither is its associated boundary set (to be defined in the next subsection). This complicates the study of uniqueness of the generalized lasso solution. In contrast, the optimal subgradient in the lasso problem is always unique, and its boundary set-called equicorrelation set in this case-is too, which makes the study of uniqueness of the lasso solution comparatively simpler (Tibshirani, 2013).
Lastly, we turn to the dual of problem (1). Standard arguments in convex analysis, as given in Tibshirani and Taylor (2011), show that the Lagrangian dual of (1) can be written as 1 Any pair (û,v) optimal in the dual (4), and solution-subgradient pair (β,γ) optimal in the primal (1), i.e., satisfying (2), (3), must satisfy the primal-dual relationships We see thatv, being a function of the fit Xβ, is always unique; meanwhile,û, being a function of the optimal subgradientγ, is not. Moreover, the optimality ofv in problem (4) can be expressed aŝ Here, (X T ) −1 (S) denotes the preimage of a set S under the linear map X T , D T S denotes the image of a set S under the linear map D T , B m ∞ (λ) = {u ∈ R m : u ∞ ≤ λ} is the ∞ ball of radius λ in R m , and P S (·) is the Euclidean projection operator onto a set S. Note that C as defined in (6) is a convex polyhedron, because the image or preimage of any convex polyhedron under a linear map is a convex polyhedron. From (5) and (6), we may hence write the fit as the residual from projecting y onto the convex polyhedron C.
The conclusion in (7), it turns out, could have been reached via direction manipulation of the KKT conditions (2), (3), as shown in Tibshirani and Taylor (2012). In fact, much of what can be seen from the dual problem (4) can also be derived using appropriate manipulations of the primal problem (1) and its KKT conditions (2), (3). However, we feel that the dual perspective, specifically the dual projection in (6), offers a simple picture that can be used to intuitively explain several key results (which might otherwise seem technical and complicated in nature). We will therefore return to it periodically.

Implicit form of solutions
Fix an arbitrary λ > 0, and let (β,γ) denote an optimal solution-subgradient pair, i.e., satisfying (2), (3). Following Taylor (2011, 2012), we define the boundary set to contain the indices of components ofγ that achieve the maximum possible absolute value, B = i ∈ {1, . . . , m} : |γ i | = 1 , and the boundary signs to be the signs ofγ over the boundary set, Sinceγ is not necessarily unique, as discussed in the previous subsection, neither are its associated boundary set and signs B, s. Note that the boundary set contains the active set A = supp(Dβ) = i ∈ {1, . . . , m} : (Dβ) i = 0 associated withβ; that B ⊇ A follows directly from the property (3) (and strict inclusion is certainly possible). Restated, this inclusion tells us thatβ must lie in the null space of D −B , i.e., Though it seems very simple, the last display provides an avenue for expressing the generalized lasso fit and solutions in terms of B, s, which will be quite useful for establishing sufficient conditions for uniqueness of the solution. Multiplying both sides of the stationarity condition (2) by P null(D −B ) , the projection matrix onto null(D −B ), we have Usingβ = P null(D −B )β , and solving for the fit Xβ (see Tibshirani and Taylor, 2012 for details or the proof of Lemma 15 for the arguments in a more general case) gives Recalling that Xβ is unique from Lemma 1, we see that the right-hand side in (8) must agree for all instantiations of the boundary set and signs B, s associated with an optimal subgradient in problem (1). Tibshirani and Taylor (2012) use this observation and other arguments to establish an important result that we leverage later, on the invariance of the space Xnull(D −B ) = col(XP null(D −B ) ) over all boundary sets B of optimal subgradients, stated in Lemma 3 for completeness. Figure 1: Geometry of the generalized lasso dual problem (4). As in (6), the dual solutionv may be seen as the projection of y onto a set C, and as in (7), the primal fit Xβ may be seen as the residual from this projection. Here, C = (X T ) −1 (D T B m ∞ (λ)), and as B m ∞ (λ) is a polyhedron (and the image or inverse image of a polyhedron under a linear map is still a polyhedron), C is a polyhedron as well. This can be used to derive the implicit form (8) for Xβ, based on the face of C on whichv lies, as explained in Remark 2.
Remark 2. As an alternative to the derivation based on the KKT conditions described above, the result (8) can be argued directly from the geometry surrounding the dual problem (4). See Figure 1 for an accompanying illustration. Given thatγ has boundary set and signs B, s, andû = λγ from (5), we see thatû must lie on the face of B m ∞ (λ) whose affine span is E B,s = {u ∈ R m : u B,s = λs}; this face is colored in black on the right-hand side of the figure. Since X Tv = D Tû , this means that v lies on the face of C whose affine span is K B,s = (X T ) −1 D T E B,s ; this face is colored in black on the left-hand side of the figure, and its affine span K B,s is drawn as a dotted line. Hence, we may refine our view ofv in (6), and in turn, Xβ in (7): namely, we may viewv as the projection of y onto the affine space K B,s (instead of C), and the fit Xβ as the residual from this affine projection. A straightforward calculation shows that , and another straightforward calculation shows that the residual from projecting y onto K B,s is (8).
From the expression in (8) for the fit Xβ, we also see that the solutionβ corresponding to the optimal subgradientγ and its boundary set and signs B, s must take the form for some b ∈ null(XP null(D −B ) ). Combining this with b ∈ null(D −B ) (following from D −Bβ = 0), we moreover have that b ∈ null(X) ∩ null(D −B ). In fact, any such point b ∈ null(X) ∩ null(D −B ) yields a generalized lasso solutionβ in (9) provided that which says thatγ appropriately matches the signs of the nonzero components of Dβ, thusγ remains a proper subgradient.
We can now begin to inspect conditions for uniqueness of the generalized lasso solution. For a given boundary set B of an optimal subgradientγ, if we know that null(X) ∩ null(D −B ) = {0}, then there can only be one solutionβ corresponding toγ (i.e., such that (β,γ) jointly satisfy (2), (3)), and it is given by the expression in (9) with b = 0. Further, if we know that null(X) ∩ null(D −B ) = {0} for all boundary sets B of optimal subgradients, and the space null(D −B ) is invariant over all choices of boundary sets B of optimal subgradients, then the right-hand side in (9) with b = 0 must agree for all proper instantiations of B, s and it gives the unique generalized lasso solution. We elaborate on this in the next section.

Invariance of the linear space Xnull(D −B )
Before diving into the technical details on conditions for uniqueness in the next section, we recall a key result from Tibshirani and Taylor (2012).
Lemma 3 (Lemma 10 in Tibshirani and Taylor, 2012). Fix any X, D, and λ > 0. There is a set N ⊆ R n of Lebesgue measure zero (that depends on X, D, λ), such that for y / ∈ N , all boundary sets B associated with optimal subgradients in the generalized lasso problem (1) give rise to the same subspace Xnull(D −B ), i.e., there is a single linear subspace L ⊆ R n such that L = Xnull(D −B ) for all boundary sets B of optimal subgradients. Moreover, for y / ∈ N , L = Xnull(D −A ) for all active sets A associated with generalized lasso solutions.

A condition on certain linear independencies
We start by formalizing the discussion on uniqueness in the paragraphs proceeding (9). As before, let λ > 0, and let B denote the boundary set associated with an optimal subgradient in (1). Denote by U (B) ∈ R p×k(B) a matrix with linearly independent columns that span null(D −B ). It is not hard to see that null(X) ∩ null(D −B ) = U (B) null(XU (B)), and thus Let us assign now such a basis matrix U (B) ∈ R p×k(B) to each boundary set B corresponding to an optimal subgradient in (1). There is a unique generalized lasso solution, as given in (9) with b = 0, provided that the following two conditions holds: rank XU (B) = k(B) for all boundary sets B associated with optimal subgradients, and (10) is invariant across all boundary sets B associated with optimal subgradients.
To see this, note that if the space null(D −B ) is invariant across all achieved boundary sets B then so is the matrix P null(D −B ) . This, and the fact that where D Tγ is unique from Lemma 2, ensures that the right-hand side in (9) with b = 0 agrees no matter the choice of boundary set and signs B, s.
Remark 3. For any subset B ⊆ {1, . . . , m}, and any matrices U (B),Ũ (B) ∈ R p×k(B) whose columns form a basis for null(D −B ), it is easy to check that rank(XU (B)) = k(B) ⇐⇒ rank(XŨ (B)) = k(B). Therefore condition (10) is well-defined, i.e., it does not depend on the choice of basis matrix U (B) associated with null(D −B ) for each boundary set B.
We now show that, thanks to Lemma 3, condition (10) (almost everywhere) implies (11), so the former is alone sufficient for uniqueness.
Lemma 4. Fix any X, D, and λ > 0. For y / ∈ N , where N ⊆ R n has Lebesgue measure zero as in Lemma 3, condition (10) implies (11). Hence, for almost every y, condition (10) is itself sufficient to imply uniqueness of the generalized lasso solution.
Proof. Let y / ∈ N , and let L be the linear subspace from Lemma 3, i.e., L = Xnull(D −B ) for any boundary set B associated with an optimal subgradient in the generalized lasso problem at y. Now fix a particular boundary set B associated with an optimal subgradient and define the linear map X : null(D −B ) → L by X (u) = Xu. By construction, this map is surjective. Moreover, assuming (10), it is injective, as and the right-hand side cannot be true unless a = b. Therefore, X is bijective and has a linear inverse, and we may write null(D −B ) = X −1 (L). As B was arbitrary, this shows the invariance of null(D −B ) over all proper choices of B, whenever y / ∈ N .
Remark 4. If D has full row rank, then by Lemma 2 the optimal subgradientγ is unique and so the boundary set B is also unique. In this case, condition (11) is vacuous and condition (10) is sufficient for uniqueness of the generalized lasso solution for every y (i.e., we do not need to rely on Lemma 4, which in turn uses Lemma 3, to prove that (10) is sufficient for almost every y).
From Lemma 4, we see that an (almost everywhere) sufficient condition for a unique solution in (1) is that the vectors XU i (B) ∈ R n , i = 1, . . . , k(B) are linearly independent, for all instantiations of boundary sets B of optimal subgradients. This may seem a little circular, to give a condition for uniqueness that itself is expressed in terms of the subgradients of solutions. But we will not stop at (10), and will derive more explicit conditions on y, X, D, and λ > 0 that imply (10) and therefore uniqueness of the solution in (1).
A first attempt is as follows: if we somehow knew that all boundary sets B were "small", in the sense that k(B) = nullity(D −B ) < n for all boundary sets B of optimal subgradients in (1), then XU i (B), i = 1, . . . , k(B) being linearly independent would be guaranteed, e.g., almost surely if the entries of X were drawn from a continuous probability distribution. While it is conceivable that a restriction on λ (i.e., a lower bound on λ) could be used to establish the condition k(B) < n for all boundary sets B, we do not pursue this, and instead pursue a more general strategy with no such restrictions on λ. Next, we refine condition (10) in such a way that it always reduces to checking a "small" set of linear independencies, regardless of the sizes of the boundary sets.

A refined condition on linear independencies
The next lemma shows that when condition (10) fails, there is a specific type of linear dependence among the columns of XU (B), for a boundary set B. The proof is not difficult, but involves careful manipulations of the KKT conditions (2), and we defer it until the appendix.
Lemma 5. Fix any X, D, and λ > 0. Let y / ∈ N , the set of zero Lebesgue measure as in Lemma 3. Assume that null(X) ∩ null(D) = {0}, and that the generalized lasso solution is not unique. Then there is a pair of boundary set and signs B, s corresponding to an optimal subgradient in problem (1), such that for any matrix U (B) ∈ R p×k(B) whose columns form a basis for null(D −B ), the following property holds of whens i2 = · · · =s i k = 0, and when at least one ofs i2 , . . . ,s i k is nonzero.
The spaces on the right-hand sides of both (12), (13) are of dimension at most n − 1. To see this, . . , k} :s ij = 0}. Hence, because these spaces are at most (n − 1)-dimensional, neither condition (12) nor (13) should be "likely" under a continuous distribution for the predictor variables X. This is made precise in the next subsection.
Before this, we define a deterministic condition on X that ensures special linear dependencies between the (transformed) columns, as in (12), (13), never hold.
Definition 1. Fix D ∈ R m×p . We say that a matrix X ∈ R n×p is in D-general position (or D-GP) if the following property holds. For each subset B ⊆ {1, . . . , m} and sign vector s ∈ {−1, 1} |B| , there is a matrix U (B) ∈ R p×k(B) whose columns form a basis for null(D −B ), such that for Z = XU (B), s = U (B) T D T B s, and all i 1 , . . . , i k ∈ {1, . . . , k(B)} withs i1 = 0 and k ≤ n + 1, it holds that , when at least one ofs i2 , . . . ,s i k is nonzero.
Remark 5. Though the definition may appear somewhat complicated, a matrix X being in D-GP is actually quite a weak condition, and can hold regardless of the (relative) sizes of n, p. We will show in the next subsection that it holds almost surely under an arbitrary continuous probability distribution for the entries of X. Further, when X = I, the above definition essentially reduces 2 to the usual notion of general position (refer to, e.g., Tibshirani, 2013 for this definition).
When X is in D-GP, we have (by definition) that (12), (13) cannot hold for any B ⊆ {1, . . . , m} and s ∈ {−1, 1} |B| (not just boundary sets and signs); therefore, by the contrapositive of Lemma 5, if we additionally have y / ∈ N and null(X) ∩ null(D) = {0}, then the generalized lasso solution must be unique. To emphasize this, we state it as a lemma.
Lemma 6. Fix any X, D, and λ > 0. If y / ∈ N , the set of zero Lebesgue measure as in Lemma 3, null(X) ∩ null(D) = {0}, and X is in D-GP, then the generalized lasso solution is unique.

Absolutely continuous predictor variables
We give an important result that shows the D-GP condition is met almost surely for continuously distributed predictors. There are no restrictions on the relative sizes of n, p. The proof of the next result uses elementary probability arguments and is deferred until the appendix.
Lemma 7. Fix D ∈ R m×p , and assume that the entries of X ∈ R n×p are drawn from a distribution that is absolutely continuous with respect to (np)-dimensional Lebesgue measure. Then X is in D-GP almost surely.
We now present a result showing that the base condition null(X) ∩ null(D) = {0} is met almost surely for continuously distributed predictors, provided that p ≤ n, or p > n and the null space of D is not too large. Its proof is elementary and found in the appendix.
Lemma 8. Fix D ∈ R m×p , and assume that the entries of X ∈ R n×p are drawn from a distribution that is absolutely continuous with respect to (np)-dimensional Lebesgue measure. If either p ≤ n, or p > n and nullity(D) ≤ n, then null(X) ∩ null(D) = {0} almost surely.
Putting together Lemmas 6, 7, 8 gives our main result on the uniqueness of the generalized lasso solution.
Theorem 1. Fix any D and λ > 0. Assume the joint distribution of (X, y) is absolutely continuous with respect to (np + n)-dimensional Lebesgue measure. If p ≤ n, or else p > n and nullity(D) ≤ n, then the solution in the generalized lasso problem (1) is unique almost surely.
Remark 6. If D has full row rank, then as discussed in Remark 4, the optimal subgradientγ in (1) is unique and hence so is its boundary set B, making condition (10) itself sufficient for uniqueness everywhere in y (not almost everywhere in y). Thus, in this case, the condition in Theorem 1 that y|X has an absolutely continuous distribution is not needed, and (with the other conditions in place) uniqueness holds for every y, almost surely over X. Under this (slight) sharpening, we can see that for D = I, Theorem 1 subsumes the lasso uniqueness result in Lemma 4 of Tibshirani (2013).
Remark 7. Generally speaking, the condition that nullity(D) ≤ n in Theorem 1 (assumed in the case p > n) is not strong. In many applications of the generalized lasso, the dimension of the null space of D is small and fixed (i.e., it does not grow with n). For example, recall Corollary 1, where the lower bound n in each of the cases reflects the dimension of the null space.

Standardized predictor variables
A common preprocessing step, in many applications of penalized modeling such as the generalized lasso, is to standardize the predictors X ∈ R n×p , meaning, center each column to have mean 0, and then scale each column to have norm 1. Here we show that our main uniqueness results carry over, mutatis mutandis, to the case of standardized predictor variables.
We begin by studying the case of centering alone. Let M = I − 11 T ∈ R n×n denote the centering map, and consider the centered generalized lasso problem We have the following uniqueness result for centered predictors.
Corollary 2. Fix any D and λ > 0. Assume the distribution of (X, y) is absolutely continuous with respect to (np + n)-dimensional Lebesgue measure. If p ≤ n − 1, or p > n − 1 and nullity(D) ≤ n − 1, then the solution in the centered generalized lasso problem (14) is unique almost surely.
and V −1 ∈ R n×(n−1) has columns that span col(M ). Note that the centered generalized lasso criterion in (14) can be written as hence problem (14) is equivalent to a regular (uncentered) generalized lasso problem with response V T −1 y ∈ R n−1 and predictor matrix V T −1 X ∈ R (n−1)×p . By straightforward arguments (using integration and change of variables), (X, y) having a density on R np+n implies that (V T −1 X, V T −1 y) has a density on R (n−1)p+(n−1) . Thus, we can apply Theorem 1 to the generalized lasso problem with response V T −1 y and predictor matrix V T −1 X to give the desired result.
Remark 8. The exact same result as stated in Corollary 2 holds for the generalized lasso problem with intercept This is because, by minimizing over β 0 in problem (15), we find that this problem is equivalent to minimization of 1 2 M y − M Xβ 2 2 + λ Dβ 1 over β, i.e., equivalent to a generalized lasso problem with response V T −1 y and predictors V T −1 X, just as in the proof of Corollary 2.
Next we treat the case of scaling alone. Let W X = diag( X 1 2 , . . . , X p 2 ) ∈ R p×p , and consider the scaled generalized lasso problem We give a helper lemma, on the distribution of a continuous random vector, post scaling. Its proof is deferred until the appendix.
Lemma 9. Let Z ∈ R n be a random vector whose distribution is absolutely continuous with respect to n-dimensional Lebesgue measure. Then, the distribution of Z/ Z 2 is absolutely continuous with respect to (n − 1)-dimensional Hausdorff measure restricted to the (n − 1)-dimensional unit sphere, We give a second helper lemma, on the (n − 1)-dimensional Hausdorff measure of an affine space intersected with the unit sphere S n−1 . This is important because checking that the scaled predictor matrix is in D-GP can be reduced to checking that none of its columns lie in a finite union of affine spaces. The proof of the lemma is deferred until the appendix.
We present a third helper lemma, which establishes that for absolutely continuous X, the scaled predictor matrix XW −1 X is in D-GP and satisfies the appropriate null space condition, almost surely. Its proof is again deferred until the appendix.
Lemma 11. Fix D ∈ R m×p , and assume that X ∈ R n×p has entries drawn from a distribution that is absolutely continuous with respect to (np)-dimensional Lebesgue measure. Then XW −1 X is in D-GP almost surely. Moreover, if p ≤ n, or p > n and nullity(D) ≤ n, then null(XW −1 X ) ∩ null(D) = {0} almost surely.
Combining Lemmas 6, 11 gives the following uniqueness result for scaled predictors.
Corollary 3. Fix any D and λ > 0. Assume the distribution of (X, y) is absolutely continuous with respect to (np + n)-dimensional Lebesgue measure. If p ≤ n, or else p > n and nullity(D) ≤ n, then the solution in the scaled generalized lasso problem (16) is unique almost surely.
Finally, we consider the standardized generalized lasso problem, where, note, the predictor matrix M XW −1 M X has standardized columns, i.e., each column has been centered to have mean 0, then scaled to have norm 1. We have the following uniqueness result for standardized predictors.
Corollary 4. Fix any D and λ > 0. Assume the distribution of (X, y) is absolutely continuous with respect to (np + n)-dimensional Lebesgue measure. If p ≤ n − 1, or p > n − 1 and nullity(D) ≤ n − 1, then the solution in the standardized generalized lasso problem (17) is unique almost surely.
Proof. Let V = [ V 1 V −1 ] ∈ R n×n be as in the proof of Corollary 2, and rewrite the criterion in (17) as precisely the scaled version of V T −1 X. From the second to last display, we see that the standardized generalized lasso problem (17) is the same as a scaled generalized lasso problem with response V T −1 y and scaled predictor matrix Under the conditions placed on y, X, as explained in the proof of Corollary 2, the distribution of (V T −1 X, V T −1 y) is absolutely continuous. Therefore we can apply Corollary 3 to give the result.

Generalized lasso with a general loss
We now extend some of the preceding results beyond the case of squared error loss, as considered previously. In particular, we consider the problem minimize β∈R p G(Xβ; y) + λ Dβ 1 , where we assume, for each y ∈ R n , that the function G( · ; y) is essentially smooth and essentially strictly convex on R n . These two conditions together mean that G( · ; y) is a closed proper convex function, differentiable and strictly convex on the interior of its domain (assumed to be nonempty), with the norm of its gradient approaching ∞ along any sequence approaching the boundary of its domain. A function that is essentially smooth and essentially strictly convex is also called, according to some authors, of Legendre type; see Chapter 26 of Rockafellar (1970). An important special case of a Legendre function is one that is differentiable and strictly convex, with full domain (all of R n ).
For much of what follows, we will focus on loss functions of the form for an essentially smooth and essentially strictly convex function ψ on R n (not depending on y). This is a weak restriction on G and encompasses, e.g., the cases in which G is the negative log-likelihood function arising from a generalized linear model for the entries of y|X with canonical link function, where ψ is the cumulant generating function. In the case of, say, a Bernoulli or Poisson model, the loss is respectively. For brevity, we will often write the loss function as G(Xβ), hiding the dependence on the response vector y.

Basic facts, KKT conditions, and the dual
The next lemma follows from arguments identical to those for Lemma 1.
Lemma 12. For any y, X, D, λ ≥ 0, and for G essentially smooth and essentially strictly convex, the following holds of problem (18).
(i) There is either zero, one, or uncountably many solutions.
(ii) Every solutionβ gives rise to the same fitted value Xβ.
Note the difference between Lemmas 12 and 1, part (i): for an arbitrary (essentially smooth and essentially strictly convex) G, the criterion in (18) need not attain its infimum, whereas the criterion in (1) always does. This happens because the criterion in (18) can have directions of strict recession (i.e., directions of recession in which the criterion is not constant), whereas the citerion in (1) cannot. Thus in general, problem (18) need not have a solution; this is true even in the most fundamental cases of interest beyond squared loss, e.g., the case of a Bernoulli negative log-likelihood G. Later in Lemma 14, we give a sufficient condition for the existence of solutions in (18).
The KKT conditions for problem (18) are whereγ ∈ R m is (as before) a subgradient of the 1 norm evaluated at Dβ, As in the squared loss case, uniqueness of Xβ by Lemma 12, along with (20), imply the next result.
Lemma 13. For any y, X, D, λ > 0, and G essentially smooth and essentially strictly convex, every optimal subgradientγ in problem (18) gives rise to the same value of D Tγ . Furthermore, when D has full row rank, the optimal subgradientγ is unique, assuming that problem (18) has a solution in the first place.
Denote by G * the conjugate function of G. When G is essentially smooth and essentially strictly convex, the following facts hold (e.g., see Theorem 26.5 of Rockafellar 1970): • its conjugate G * is also essentially smooth and essentially strictly convex; and • the map ∇G : int(dom(G)) → int(dom(G * )) is a homeomorphism with inverse (∇G) −1 = ∇G * .
The conjugate function is intrinsically tied to duality, directions of recession, and the existence of solutions. Standard arguments in convex analysis, deferred to the appendix, give the next result.
Lastly, existence of primal and dual solutions is guaranteed under the conditions In particular, under (24), a solution exists in the dual problem (22), and under (24), (25), a solution exists in the primal problem (18).
Assuming that primal and dual solutions exist, we see from (23) in the above lemma thatv must be unique (by uniqueness of Xβ, from Lemma 12), butû need not be (asγ is not necessarily unique). Moreover, under condition (24), we know that G is differentiable at 0, and ∇G * (∇G(0)) = 0, hence we may rewrite (22) as where x − z denotes the Bregman divergence between points x, z, with respect to a function f . Optimality ofv in (26) may be expressed aŝ Here, recall (X T ) −1 (S) denotes the preimage of a set S under the linear map X T , D T S denotes the image of a set S under the linear map D, B m ∞ (λ) = {u ∈ R m : u ∞ ≤ λ} is the ∞ ball of radius λ in R m , and now P f S (·) is the projection operator onto a set S with respect to the Bregman divergence of a function f , i.e., P f S (z) = arg min x∈S D f (x, z). From (27) and (23), we see that We note the analogy between (27), (28) and (6), (7) in the squared loss case; for G(z) = 1 2 y − z 2 2 , we have ∇G(0) = −y, G * (z) = 1 2 y + z 2 2 − 1 2 y 2 2 , ∇G * (z) = y + z, −P G * −C (∇G(0)) = P C (y), and so (27), (28) match (6), (7), respectively. But when G is non-quadratic, we see that the dual solution v and primal fit Xβ are given in terms of a non-Euclidean projection operator, defined with respect to the Bregman divergence of G * . See Figure 2 for an illustration. This complicates the study of the primal and dual problems, in comparison to the squared loss case; still, as we will show in the coming subsections, several key properties of primal and dual solutions carry over to the current general loss setting.

Implicit form of solutions
Fix an arbitrary λ > 0, and let (β,γ) denote an optimal solution-subgradient pair, i.e., satisfying (20), (21). As before, we define the boundary set and boundary signs in terms ofγ, B = i ∈ {1, . . . , m} : |γ i | = 1 , and s = sign(γ B ). By (20), we have that A ⊆ B. In general, A, r, B, s are not unique, as neitherβ norγ are. As in the squared loss case, we can derive an implicit form for the fitted value and solution in terms of B, s.
The next lemma gives an implicit form for the fit and solutions in (18), with G as in (19), akin to the results (8), (9) in the squared loss case. Its proof stems directly from the KKT conditions (29); it is somewhat technical and deferred until the appendix.
Lemma 15. Fix any y, X, D, and λ > 0. Assume that G is of the form (19), where ψ is essentially smooth and essentially strictly convex, and satisfies (31), (32). Letβ be a solution in problem (18), and letγ be a corresponding optimal subgradient, with boundary set and boundary signs B, s. Define the affine subspace Then the unique fit can be expressed as and the solution can be expressed aŝ for some b ∈ null(X) ∩ null(D −B ). Similarly, letting A, r denote the active set and active signs ofβ, the same expressions hold as in the last two displays with B, s replaced by A, r (i.e., with the affine subspace of interest now being Remark 9. The proof of Lemma 15 derives the representation (34) using technical manipulation of the KKT conditions. But the same result can be derived using the geometry surrounding the dual problem (26). See Figure 2 for an accompanying illustration, and Remark 2 for a similar geometric argument in the squared loss case. Asγ has boundary set and signs B, s, andû = λγ from (23), we see thatû must lie on the face of B m ∞ (λ) whose affine span is E B,s = {u ∈ R m : u B,s = λs}; and as X Tv = D Tû , we see thatv lies on the face of C whose affine span is K B,s = (X T ) −1 D T E B,s , which, it can be checked, can be rewritten explicitly as the affine subspace in (33). Hence, the projection of ∇G(0) onto −C lies on a face whose affine span is −K B,s , and we can write i.e., we can simply replace the set −C in (27) with −K B,s . When G is of the form (19), repeating the same arguments as before therefore shows that the dual and primal projections in (30) hold with −C replaced by −K B,s , which yields the primal projection result in (34) in the lemma.
Though the form of solutions in (35) appears more complicated in form than the form (9) in the squared loss case, we see that one important property has carried over to the general loss setting, namely, the property that b ∈ null(X) ∩ null(D −B ). As before, let us assign to each boundary set B associated with an optimal subgradient in (18) a basis matrix U (B) ∈ R p×k(B) , whose linearly independent columns that span null(D −B ). Then by the same logic as explained at the beginning of Section 3.1, we see that, under the conditions of Lemma 15, there is a unique solution in (18), given by (35) with b = 0, provided that conditions (10), (11) hold.
The arguments in the squared loss case, proceeding the observation of (10), (11) as a sufficient condition, relied on the invariance of the linear subspace Xnull(D −B ) over all boundary sets B of optimal subgradients in the generalized lasso problem (1). This key result was established, recall, in Lemma 10 of Tibshirani and Taylor (2012), transcribed in our Lemma 3 for convenience. For the general loss setting, no such invariance result exists (as far as we know). Thus, with uniqueness in mind as the end goal, we take somewhat of a detour and study local properties of generalized lasso solutions, and invariance of the relevant linear subspaces, over the next two subsections.

Local stability
We establish a result on the local stability of the boundary set and boundary signs B, s associated with an optimal solution-subgradient pair (β,γ), i.e., satisfying (20), (21). This is a generalization of Lemma 9 in Tibshirani and Taylor (2012), which gives the analogous result for the case of squared loss. We must first introduce some notation. For arbitrary subsets A ⊆ B ⊆ {1, . . . , m}, denote  (ψ))).
Next we present the local stability result. Its proof is lengthy and deferred until the appendix.
Lemma 16. Fix any X, D, and λ > 0. Fix y / ∈ N , where the set N is defined in (37). Assume that G is of the form (19), where ψ is essentially smooth and essentially strictly convex, satisfying (31), (32). That is, our assumptions on the response are succinctly: y ∈ N c ∩ (int(ran(∇ψ)) + C). Denote an optimal solution-subgradient pair in problem (18) by (β(y),γ(y)), our notation here emphasizing the dependence on y, and similarly, denote the associated boundary set, boundary signs, active set, and active signs by B(y), s(y), A(y), r(y), respectively. Then there is a neighborhood U of y such that, for any y ∈ U , problem (18) has a solution, and in particular, has an optimal solution-subgradient pair (β(y ),γ(y )) with the same boundary set B(y ) = B(y), boundary signs s(y ) = s(y), active set A(y ) = A(y), and active signs r(y ) = r(y).
Remark 10. The set N defined in (37) is bigger than it needs to be; to be precise, the same result as in Lemma 16 actually holds with N replaced by the smaller set which can be seen from the proof of Lemma 16, as can be N * ⊆ N . However, the definition of N in (37) is more explicit than that of N * in (38), so we stick with the former set for simplicity.
Remark 11. For each triplet A, B, s in the definition (37) over which the union is defined, the sets K B,s and col(XP null(D −B ) ) ∩ null(M A,B ) both have Lebesgue measure zero, as they are affine spaces of dimension at most n − 1. When ∇ψ : int(dom(ψ)) → int(dom(ψ * )) is a C 1 diffeomorphism-which holds true when ψ is the cumulant generating function for the Bernoulli or Poisson cases-the image ∇ψ(col(XP null(D −B ) ) ∩ null(M A,B )) also has Lebesgue measure zero, for each triplet A B, s, and thus N (being a finite union of measure zero sets) has measure zero.

Invariance of the linear space Xnull(D −B )
We leverage the local stability result from the last subsection to establish an invariance of the linear subspace Xnull(D −B ) over all choices of boundary sets B corresponding to an optimal subgradient in (18). This is a generalization of Lemma 10 in problem Tibshirani and Taylor (2012), which was transcribed in our Lemma 3. The proof is again deferred until the appendix.
Lemma 17. Assume the conditions of Lemma 16. Then all boundary sets B associated with optimal subgradients in problem (18) give rise to the same subspace Xnull(D −B ), i.e., there is a single linear subspace L ⊆ R n such that L = Xnull(D −B ) for all boundary sets B of optimal subgradients. Further, L = Xnull(D −A ) for all active sets A associated with solutions in (18).
As already mentioned, Lemmas 16 and 17 extend Lemmas 9 and 10, respectively, of Tibshirani and Taylor (2012) to the case of a general loss function G, taking the generalized linear model form in (19). This represents a significant advance in our understanding of the local nature of generalized lasso solutions outside of the squared loss case. For example, even for the special case D = I, that logistic lasso solutions have locally constant active sets, and that col(X A ) is invariant to all choices of active set A, provided y is not in an "exceptional set" N , seem to be interesting and important findings. These results could be helpful, e.g., in characterizing the divergence, with respect to y, of the generalized lasso fit in (34), an idea that we leave to future work.

Sufficient conditions for uniqueness
We are now able to build on the invariance result in Lemma 17, just as we did in the squared loss case, to derive our main result on uniqueness in the current general loss setting.
Theorem 2. Fix any X, D, and λ > 0. Assume that G is of the form (19), where ψ is essentially smooth and essentially strictly convex, and satisfies (31). Assume: (a) null(X) ∩ null(D) = {0}, and X is in D-GP; or (b) the entries of X are drawn from a distribution that is absolutely continuous on R np , and p ≤ n; or (c) the entries of X are drawn from a distribution that is absolutely continuous on R np , p > n, and nullity(D) ≤ n.
In case (a), the next statement (the conclusion) holds deterministically; in cases (b) or (c), it holds with almost surely with respect to the distribution of X. For any y ∈ N c ∩ (int(ran(∇ψ)) + C), where the set N is defined in (37), problem (18) has a unique solution.
Proof. Under the conditions of the theorem, Lemma 15 shows that any solution in (18) must take the form (35). As in the arguments in Section 3.1, in the squared loss case, we see that (10), (11) are together sufficient for implying uniqueness of the solution in (18). Moreover, Lemma 17 implies the linear subspace L = Xnull(D −B ) is invariant under all choices of boundary sets B corresponding to optimal subgradients in (18); as in the proof of Lemma 4 in the squared loss case, such invariance implies that (10) is by itself a sufficient condition. Finally, if (10) does not hold, then X cannot be in D-GP, which follows by the applying the arguments Lemma 5 in the squared loss case to the KKT conditions (29). This completes the proof under condition (a). Recall, conditions (b) or (c) simply imply (a) by Lemmas 7 and 8.
As explained in Remark 11, the set N in (37) has Lebesgue measure zero for G as in (19), when ∇ψ is a C 1 diffeomorphism, which is true, e.g., for ψ the Bernoulli or Poisson cumulant generating function. However, in the case that ψ is the Bernoulli cumulant generating function, and G is the associated negative log-likelihood, it would of course be natural to assume that the entries of y|X follow a Bernoulli distribution, and under this assumption it is not necessarily true that the event y ∈ N has zero probability. A similar quandary holds for the Poisson case. In short, it does not seem straightforward to bound the probability that y ∈ N in cases of fundamental interest, e.g., when the entries of y|X follow a Bernoulli or Poisson model and G is the associated negative log-likehood, but intuitively y ∈ N seems "unlikely" in these cases. A careful analysis is left to future work.

Discussion
In this paper, we derived sufficient conditions for the generalized lasso problem (1) to have a unique solution, which allow for p > n (in fact, allow for p to be arbitrarily larger than n): as long as the predictors and response jointly follow a continuous distribution, and the null space of the penalty matrix has dimension at most n, our main result in Theorem 1 shows that the solution is unique. We have also extended our study to the problem (18), where the loss is of generalized linear model form (19), and established an analogous (and more general) uniqueness result in Theorem 2. Along the way, we have also shown some new results on the local stability of boundary sets and active sets, in Lemma 16, and on the invariance of key linear subspaces, in Lemma 17, in the generalized linear model case, which may be of interest in their own right.
An interesting direction for future work is to carefully bound the probability that y ∈ N , where N is as in (37), in some typical generalized linear models like the Bernoulli and Poisson cases. This would give us a more concrete probabillistic statement about uniqueness in such cases, following from Theorem 2. Another interesting direction is to inspect the application of Theorems 1 and 2 to additive trend filtering and varying-coefficient models. Lastly, the local stability result in Lemma 16 seems to suggest that a nice expression for the divergence of the fit (34), as a function of y, may be possible (furthermore, Lemma 17 suggests that this expression should be invariant to the choice of boundary set). This may prove useful for various purposes, e.g., for constructing unbiased risk estimates in penalized generalized linear models.

A.1 Proof of Lemma 5
As the generalized lasso solution is not unique, we know that condition (10) cannot hold, and there exist B, s associated with an optimal subgradient in problem (1) for which rank(XU (B)) < k(B), for any U (B) ∈ R p×k(B) whose linearly independent columns span null(D −B ). Thus, fix an arbitrary choice of basis matrix U (B). Then by construction we have that Z i = XU i (B) ∈ R n , i = 1, . . . , k(B) are linearly dependent.
Note that multiplying both sides of the KKT conditions (2) by U (B) T gives by definition ofs. We will first show that the assumptions in the lemma,s = 0. To see this, ifs = 0, then at any solutionβ as in (9) There are two cases to consider. Ifs ij = 0 for all j = 2, . . . , k, then we must have c 1 = 0, so from (40), If insteads ij = 0 for some j = 2, . . . , k, then define J = {j ∈ {1, . . . , k} :s ij = 0} (which we know in the present case has cardinality |J | ≥ 2). Rewrite (41) as and hence rewrite (40) as Reflecting on the two conclusions (42), (43) from the two cases considered, we can reexpress these as (12), (13), respectively, completing the proof.

A.2 Proof of Lemma 7
Fix an arbitrary B ⊆ {1, . . . , m} and s ∈ {−1, 1} |B| . Define U (B) ∈ R p×k(B) whose columns form a basis for null(D −B ) by running Gauss-Jordan elimination on D −B . We may assume without a loss of generality that this is of the form where I ∈ R k(B)×k(B) is the identity matrix and F ∈ R (p−k(B))×k(B) is a generic dense matrix. (If need be, then we can always permute the columns of X, i.e., relabel the predictor variables, in order to obtain such a form.) This allows us to express the columns of Z = XU (B) as Importantly, for each i = 1, . . . , k(B), we see that only Z i depends on X i (i.e., no other Z j , j = i depends on X i ). Select any i 1 , . . . , i k ∈ {1, . . . , k(B)} withs i1 = 0 and k ≤ n + 1. Suppose first that s i2 = · · · =s i k = 0. Then Conditioning on X j , j = i 2 , the right-hand side above is just some fixed affine space of dimension at most n − 1, and so owing to the fact that X i2 | X j , j = i 2 has a continuous distribution over R n . Integrating out over X j , j = i 2 then gives which proves a violation of case (i) in the definition of D-GP happens with probability zero. Similar arguments show that a violation of case (ii) in the definition of D-GP happens with probability zero. Taking a union bound over all possible B, s, i 1 , . . . , i k , and k shows that any violation of the defining properties of the D-GP condition happens with probability zero, completing the proof.

A.3 Proof of Lemma 8
Checking that null(X) ∩ null(D) = {0} is equivalent to checking that the matrix M = X D has linearly independent columns. In the case p ≤ n, the columns of X will be linearly independent almost surely (the argument for this is similar to the arguments in the proof of Lemma 7), so the columns of M will be linearly independent almost surely. Thus assume p > n. Let q = nullity(D), so r = rank(D) = p − q. Pick r columns of D that are linearly independent; then the corresponding columns of M are linearly independent. It now suffices to check linear independence of the remaining p − r columns of M . But any n columns of X will be linearly independent almost surely (again, the argument for this is similar to the arguments from the proof of Lemma 7), so the result is given provided p − r ≤ n, i.e., q ≤ n.

A.4 Proof of Lemma 9
Let σ n−1 denote the (n − 1)-dimensional spherical measure, which is just a normalized version of the (n − 1)-dimensional Hausdorff measure H n−1 on the unit sphere S n−1 , i.e., defined by Thus, it is sufficient to prove that the distribution of Z/ Z 2 is absolutely continuous with respect to σ n−1 . For this, it is helpful to recall that an alternative definition of the (n − 1)-dimensional spherical measure, for an arbitrary α > 0, is where L n denotes n-dimensional Lebesgue measure, B n α = {x ∈ R n : x 2 ≤ α} is the n-dimensional ball of radius α, and cone α (S) = {tx : x ∈ S, t ∈ [0, α]}. That (45) and (44) coincide is due to the fact that any two measures that are uniformly distributed over a separable metric space must be equal up to a positive constant (see Theorem 3.4 in Mattila 1995), and as both (45) and (44) are probability measures on S n−1 , this positive constant must be 1. Now let S ⊆ S n−1 be a set of null spherical measure, σ n−1 (S) = 0. From the representation for spherical measure in (45), we see that L n (cone α (S)) = 0 for any α > 0. Denoting cone(S) = {tx : x ∈ S, t ≥ 0}, we have L n (cone(S)) = L n ∞ k=1 cone k (S) ≤ ∞ k=1 L n (cone k (S)) = 0.
This means that P(Z ∈ cone(S)) = 0, as the distribution of Z is absolutely continuous with respect to L n , and moreover P(Z/ Z 2 ∈ S) = 0, since Z ∈ cone(S) ⇐⇒ Z ∈ Z/ Z 2 ∈ S. This completes the proof.

A.5 Proof of Lemma 10
Denote the n-dimensional unit ball by B n = {x ∈ R n : x 2 ≤ 1}. Note that the relative boundary of B n ∩ A is precisely relbd(B n ∩ A) = S n−1 ∩ A.
As the boundary of a convex set has Lebesgue measure zero (see Theorem 1 in Lang 1986), we claim this implies S n−1 ∩ A has (n − 1)-dimensional Hausdorff measure zero. To see this, note first that we can assume without a loss of generality that dim(A) = n − 1, else the claim follows immediately. We can now interpret B n ∩ A as a set in the ambient space A, which is diffeomorphic-via a change of basis-to R n−1 .
To be more precise, if V ∈ R n×(n−1) is a matrix whose columns are orthonormal and span the linear part of A, and a ∈ A is arbitrary, then V T (B n ∩ A − a) ⊆ R n−1 is a convex set, and by the fact cited above its boundary must have (n − 1)-dimensional Lebesgue measure zero. It can be directly checked that As the (n − 1)-dimensional Lebesgue measure and (n − 1)-dimensional Hausdorff measure coincide on R n−1 , we see that V T (S n−1 ∩ A − a) has (n − 1)-dimensional Hausdorff measure zero. Lifting this set back to R n , via the transformation we see that S n−1 ∩ A too must have Hausdorff measure zero, the desired result, because the map x → V x + a is Lipschitz (then apply, e.g., Theorem 1 in Section 2.4.1 of Evans and Gariepy 1992).

A.6 Proof of Lemma 11
Let us abbreviateX = XW −1 X for the scaled predictor matrix, whose columns areX i = X i / X i 2 , i = 1, . . . , p. By similar arguments to those given in the proof of Lemma 7, to showX is in D-GP almost surely, it suffices to show that for each i = 1, . . . , p, where A ⊆ R n is an affine space depending onX j , j = i. This follows by applying our previous two lemmas: the distribution ofX i is absolutely continuous with respect (n − 1)-dimensional Hausdorff measure on S n−1 , by Lemma 9, and S n−1 ∩ A has (n − 1)-dimensional Hausdorff measure zero, by Lemma 10.
To establish that the null space condition null(X) ∩ null(D) = {0} holds almost surely, note that the proof of Lemma 8 really only depends on the fact that any collection of k columns of X, for k ≤ n, are linearly independent almost surely. It can be directly checked that the scaled columns of X share this same property, and thus we can repeat the same arguments as in Lemma 8 to give the result.
The Lagrangian of the above problem is and minimizing the Lagrangian over β, z gives the dual problem where G * is the conjugate of G, and h * is the conjugate of h.
and hence the dual problem (48) is equivalent to the claimed one (22). As G is essentially smooth and essentially strictly convex, the interior of its domain is nonempty. Since the domain of h is all of R p , this is enough to ensure that strong duality holds between (46) and (48). Further, if a solutionβ,ẑ is attained in (46), and a solutionv is attained in (48), then by minimizing the Lagrangian L(β, z,v) in (47) over β and z, we have the relationships respectively, where ∂h(·) is the subdifferential operator of h. The first relationship in (49) can be rewritten as ∇G(Xβ) = −v, matching that in (23). The second relationship in (49) can be rewritten as D Tû ∈ ∂h(β), whereû ∈ D T B m ∞ (λ) is such that X Tv = D Tû , which is equivalent toû/λ being a subgradient of the 1 norm evaluated at Dβ, matching the second relationship in (23).
Lastly, we address the constraint qualification conditions (24), (25). When (24) holds, we know that G * has no directions of recession, so the dual problem (22) has a solution, (see, e.g., Theorems 27.1 and 27.3 in Rockafellar 1970), equivalently, problem (48) has a solution. When (25) also holds, we have the further guarantee that −v ∈ int(dom(G * )) by essential smoothness of G * (in particular, by the property that ∇G * 2 approaches ∞ along any sequence that converges to a boundary point of the domain; see, e.g., Theorem 3.12 in Bauschke and Borwein 1997). Recalling that the mapping ∇G : int(dom(G)) → int(dom(G * )) is a homeomorphism, we see that int(dom(G * )) = int(ran(∇G)), and so the Lagrangian L(β, z,v) in (47) attains its infimum over z at the pointẑ = ∇G * (−v). That the Lagrangian L(β, z,v) attains its infimum over β is due to the fact that map β → h(β) −v T Xβ has no strict directions of recession (directions of recession in which this map is not constant). This ensures that problem (46) has a solution, equivalently, that problem (18) has a solution, completing the proof.

A.8 Proof of Lemma 15
We first establish (34), (35). Multiplying both sides of stationarity condition (29) by P null(D −B ) yields We will now show that (50), (51) together imply ∇ψ(Xβ) can be expressed in terms of a certain Bregman projection onto an affine subspace, with respect to ψ * . To this end, consider for a function f , point a, and set S. The first-order optimality conditions are When S is an affine subspace, i.e., S = c + L for a point c and linear subspace L, this reduces to i.e., P L ∇f (x) = P L ∇f (a), and P L ⊥x = P L ⊥ c.

A.9 Proof of Lemma 16
The proof follows a similar general strategy to that of Lemma 9 in Tibshirani and Taylor (2012). We will abbreviate B = B(y), s = s(y), A = A(y), and r = r(y). Consider the representation forβ(y) in (35) of Lemma 15. As the active set is A, we know that i.e., and so P [D B\A (null(X)∩null(D −B ))] ⊥ D B\A (XP null(D −B ) ) + ∇ψ * P ψ * y−K B,s ∇ψ(0) = 0.
Since ∇ψ * (x) = Xβ(y), we have ∇ψ * (x) ∈ col(XP null(D −B ) ), so combining this with above display, and using (∇ψ * ) −1 = ∇ψ, giveŝ And sincex ∈ y − K B,s , with K B,s an affine space, as defined in (33), we have y ∈x + K B,s , which combined with the last display implies But as y / ∈ N , where the set N is defined in (37), we arrive at This is an important realization that we will return to shortly. As for the optimal subgradientγ(y) corresponding toβ(y), note that we can writê for some c ∈ null(D T −B ). The first expression holds by definition of B, s, and the second is a result of solving forγ −B (y) in the stationarity condition (29), after plugging in for the form of the fit in (34). Now, at a new response y , consider defininĝ β(y ) = (XP null(D −B ) ) + ∇ψ * P ψ * y −K B,s ∇ψ(0) + b , γ B (y ) = λs, for some b ∈ null(X) ∩ null(D −B ) to be specified later, and for the same value of c ∈ null(D T −B ) as in (54). By the same arguments as given at the end of the proof of Lemma 14, where we discussed the constraint qualification conditions (24), (25), the Bregman projection P ψ * y −K B,s (∇ψ(0)) in the above expressions is well-defined, for any y , under (31). However, this Bregman projection need not lie in int(dom(ψ * ))-and therefore ∇ψ * (P ψ * y −K B,s (∇ψ(0))) need not be well-defined-unless we have the additional condition y ∈ int(ran(∇ψ)) + C. Fortunately, under (32), the latter condition on y is implied as long as y is sufficiently close to y, i.e., there exists a neighborhood U 0 of y such that y ∈ int(ran(∇ψ)) + C, provided y ∈ U 0 . By Lemma 14, we see that a solution in (18) exists at such a point y . In what remains, we will show that this solution and its optimal subgradient obey the form in the above display.
Note that, by construction, the pair (β(y ),γ(y )) defined above satisfy the stationarity condition (29) at y , andγ(y ) has boundary set and boundary signs B, s. It remains to show that (β(y ),γ(y )) satisfy the subgradient condition (21), and thatβ(y ) has active set and active signs A, r; equivalently, it remains to verify the following three properties, for y sufficiently close to y, and for an appropriate choice of b : (i) γ −B (y ) ∞ < 1; (ii) supp(Dβ(y )) = A; (iii) sign(D Aβ (y )) = r.
Becauseγ(y) is a subgradient corresponding toβ(y), and has boundary set and boundary signs B, s, we know thatγ −B (y) in (54) has ∞ norm strictly less than 1. Thus, by continuity of at y, which is implied by continuity of x → P ψ * x−K B,s (∇ψ(0)) at y, by Lemma 18, we know that there exists some neighborhood U 1 of y such that property (i) holds, provided y ∈ U 1 .
By the important fact established in (53), we see that there exists b ∈ null(X) ∩ null(D −B ) such that D B\A b = −D B\A (XP null(D −B ) ) + ∇ψ * P ψ * y −K B,s ∇ψ(0) , which implies that D B\Aβ (y ) = 0. To verify properties (ii) and (iii), we must show this choice of b is such that D Aβ (y ) is nonzero in every coordinate and has signs matching r. Define a map T (x) = (XP null(D −B ) ) + ∇ψ * P ψ * x−K B,s ∇ψ(0) , which is continuous at y, again by continuity of x → P ψ * x−K B,s (∇ψ(0)) at y, by Lemma 18. Observe that D Aβ (y ) = D A T (y ) As D Aβ (y) = D A T (y) + D A b is nonzero in every coordinate and has signs equal to r, by definition of A, r, and T is continuous at y, there exists a neighborhood U 2 of y such that D A T (y ) + D A b is nonzero in each coordinate with signs matching r, provided y ∈ U 2 . Furthermore, as where D T 2,∞ denotes the maximum 2 norm of rows of D, we see that D A T (y ) + D A b will be nonzero in each coordinate with the correct signs, provided b can be chosen arbitrarily close to b, subject to the restrictions b ∈ null(X) ∩ null(D −B ) and D B\A b = −D B\A T (y ).
Such a b does indeed exist, by the bounded inverse theorem. Let L = null(X) ∩ null(D −B ), and N = null(D B\A ) ∩ L. Consider the linear map D B\A , viewed as a function from L/N (the quotient of L by N ) to D B\A (L): this is a bijection, and therefore it has a bounded inverse. This means that there exists some R > 0 such that b − b 2 ≤ R D B\A T (y ) − D B\A T (y) 2 , for a choice of b ∈ null(X) ∩ null(D −B ) with D B\A b = −D B\A T (y ). By continuity of T at y, once again, there exists a neighborhood U 3 of y such that the right-hand side above is sufficiently small, i.e., such that b − b 2 is sufficiently small, provided y ∈ U 3 .
Finally, letting U = U 0 ∩ U 1 ∩ U 2 ∩ U 3 , we see that we have established properties (i), (ii), and (iii), and hence the desired result, provided y ∈ U .

A.10 Continuity result for Bregman projections
Lemma 18. Let f, f * be a conjugate pair of Legendre (essentially smooth and essentially strictly convex) functions on R n , with 0 ∈ int(dom(f * )). Let S ⊆ R n be a nonempty closed convex set. Then the Bregman projection map x → P f x−S ∇f * (0) is continuous on all of R n . Moreover, P f x−S (∇f * (0)) ∈ int(dom(f )) for any x ∈ int(dom(f )) + S.
Proof. As 0 ∈ int(dom(f * )), we know that f has no directions of recession (e.g., by Theorems 27.1 and 27.3 in Rockafellar 1970), hence the Bregman projection P f x−S (∇f * (0)) is well-defined for any x ∈ R n . Further, for x − S ∈ int(dom(f )), we know that P f x−S (∇f * (0)) ∈ int(dom(f )), by essential smoothness of f (by the property that ∇f 2 approaches ∞ along any sequence that converges to boundary point of dom(f ); e.g., see Theorem 3.12 in Bauschke and Borwein 1997).
It remains to verify continuity of x → P f x−S (∇f * (0)). Write P f x−S (∇f * (0)) =v, wherev is the unique solution of minimize or equivalently, P f x−S (∇f * (0)) =ŵ + x, whereŵ is the unique solution of minimize w∈−S f (w + x).
It suffices to show continuity of the unique solution in the above problem, as a function of x. This can be established using results from variational analysis, provided some conditions are met on the bi-criterion function f 0 (w, x) = f (w + x). In particular, Corollary 7.43 in Rockafellar and Wets (2009) implies that the unique minimizer in the above problem is continuous in x, provided f 0 is a closed proper convex function that is level-bounded in w locally uniformly in x. By assumption, f is a closed proper convex function (it is Legendre), and thus so is f 0 . The level-boundedness condition can be checked as follows. Fix any α ∈ R and x ∈ R n . The α-level set {w : f (w + x) ≤ α} is bounded since x → f (x + w) has no directions of recession (to see that this implies boundedness of all level sets, e.g., combine Theorem 27.1 and Corollary 8.7.1 of Rockafellar 1970). Meanwhile, for any x ∈ R n , Hence, the α-level set of f 0 (·, x ) is uniformly bounded for all x in a neighborhood of x, as desired. This completes the proof.

A.11 Proof of Lemma 17
The proof is similar to that of Lemma 10 in Tibshirani and Taylor (2012). Let B, s be the boundary set and signs of an arbitrary optimal subgradient inγ(y) in (18), and let A, r be the active set and active signs of an arbitrary solution inβ(y) in (18). (Note thatγ(y) need not correspond toβ(y); it may be a subgradient corresponding to another solution in (18).) By (two applications of) Lemma 16, there exist neighborhoods U 1 , U 2 of y such that, over U 1 , optimal subgradients exist with boundary set and boundary signs B, s, and over U 2 , solutions exist with active set and active signs A, r. For any y ∈ U = U 1 ∩ U 2 , by Lemma 15 and the uniqueness of the fit from Lemma 12, we have Xβ(y) = ∇ψ * P ψ * y−K B,s ∇ψ(0) = ∇ψ * P ψ * y−K A,r ∇ψ(0) , and as ∇ψ * is a homeomorphism, P ψ * y −K B,s ∇ψ(0) = P ψ * y −K A,r ∇ψ(0) .
We claim that this implies null(P null(D −B ) X T ) = null(P null(D −A ) X T ). To see this, take any direction z ∈ null(P null(D −B ) X T ), and let > 0 be sufficiently small so that y = y + z ∈ U . From (55), we have P ψ * y −K A,r ∇ψ(0) = P ψ * y −K B,s ∇ψ(0) = P ψ * y−K B,s ∇ψ(0) = P ψ * y−K A,r ∇ψ(0) , where the second equality used y − K B,s = y − K B,s , and the third used the fact that (55) indeed holds at y. Now consider the left-most and right-most expressions above. For these two projections to match, we must have z ∈ null(P null(D −A ) X T ); otherwise, the affine subspaces y − K A,r and y − K A,r would be parallel, in which case clearly the projections cannot coincide. Hence, we have shown that null(P null(D −B ) X T ) ⊆ null(P null(D −A ) X T ). The reverse inclusion follows similarly, establishing the desired claim. Lastly, as B, A were arbitrary, the linear subspace L = null(P null(D −B ) X T ) = null(P null(D −A ) X T ) must be unchanged for any choice of boundary set B and active set A at y, completing the proof.