On model selection consistency of regularized M-estimators

Regularized M-estimators are used in diverse areas of science and engineering to fit high-dimensional models with some low-dimensional structure. Usually the low-dimensional structure is encoded by the presence of the (unknown) parameters in some low-dimensional model subspace. In such settings, it is desirable for estimates of the model parameters to be \emph{model selection consistent}: the estimates also fall in the model subspace. We develop a general framework for establishing consistency and model selection consistency of regularized M-estimators and show how it applies to some special cases of interest in statistical learning. Our analysis identifies two key properties of regularized M-estimators, referred to as geometric decomposability and irrepresentability, that ensure the estimators are consistent and model selection consistent.


Introduction
The principle of parsimony is used in many areas of science and engineering to promote "simple" models over more complex ones. In machine learning, signal processing, and high-dimensional statistics, this principle motivates the use of sparsity inducing penalties for variable/feature selection and signal recovery from incomplete measurements. In this work, we consider M-estimators of the form: minimize θ∈R p ℓ (n) (θ) + λρ(θ), (1.1) where ℓ (n) is a convex, twice continuously differentiable loss function and ρ is a penalty function. Many commonly used penalties are geometrically decomposable, i.e. can be expressed as a sum of support functions. We generalize the notion of irrepresentability and develop a general framework to establish consistency and model selection consistency of these penalized M-estimators. When specialized to various statistical models, our framework yields some known and some new model selection consistency results.
The rest of the paper is organized as follows: First, we review existing work on consistency and model selection consistency of penalized M-estimators. Then, in Section 2, we introduce geometrically decomposable penalties and give two examples from statistical learning. In Section 3, we generalize the notion of irrepresentability and state our main result (Theorem 3.4). We prove our main result in Section 4 and develop a converse result concerning the necessity of the irrepresentable condition in Section 6. Finally, in Section 5, we use our main result to derive consistency and model selection consistency results for two statistical models and verify the consequences of these results empirically.

Consistency and model selection consistency of penalized M-estimators
The consistency of penalized M-estimators has been studied extensively. The three most well-studied problems are (i) the lasso [6,3,31], (ii) generalized linear models (GLM) with the lasso penalty [13], and (iii) penalized inverse covariance estimators (equivalent to penalized maximum likelihood for a Gaussian Markov random field) [2,26,15,25]. There are also many results for M-estimators with group and structured variants of the lasso penalty [1,11,18,10]. Negahban et al. [22] proposed a unified framework for establishing consistency and convergence rates for M-estimators with penalties ρ that are decomposable with respect to a pair of subspaces M,M : ρ(x + y) = ρ(x) + ρ(y), for all x ∈ M, y ∈M ⊥ .
Many commonly used penalties such as the lasso, group lasso, and nuclear norm are decomposable in this sense. Negahban et al. also develop a notion of restricted strong convexity for such penalties and prove a general result that establishes the consistency of M-estimators with these penalties. Using their framework, they derive consistency results for special cases like sparse and group sparse regression. We focus on model selection consistency of penalized M-estimators. Previous work in this area identified the notion of irrepresentability for the lasso [20,35,31] and then generalized to GLM's with the lasso penalty [5,24,33]. These results were later extended to group and structured variants of the lasso penalty [34,21,14,27,12,23,29]. The irrepresentable condition has also been used to obtain model selection consistency results for estimating inverse covariance matrices with the lasso penalty [15,25]. These methods have been extended to fit discrete graphical models using penalized composite likelihood estimators [9] and generalized covariance matrices [19].
There is also a rich literature on constrained M-estimators of the form minimize θ∈R p ρ(θ), subject to A(θ) ∈ K, where A is an affine mapping and K is a convex cone. We do not review this literature except to describe a notion of decomposability proposed by Candés and Recht [7]. A penalty ρ is decomposable according to Definition 2.2 in [7] if there exists a subspace S such that ∂ρ(θ ⋆ ) has the form where θ ⋆ is the unknown parameters and z S ⊥ is the component of z in S ⊥ . Many commonly used penalties such as the lasso, group lasso, and nuclear norm are also decomposable in this sense. We refer to the introduction in [7] for a review of the work about constrained M-estimators in compressed sensing and low-rank matrix completion.

Geometrically decomposable penalties
Let C be a closed convex set in R p . Then the support function of C is If C is a norm ball, i.e. C = {x | x ≤ 1}, then h C is the dual norm: The suppport function is a supremum of linear functions, hence The support function (as a function of the convex set C) is also additive over Minkowski sums, i.e.
We use this property to express penalty functions as sums of support functions. E.g. if ρ is a norm and the dual norm ball can be expressed as a (Minkowski) sum of convex sets, then ρ can be expressed as a sum of support functions. If a penalty function ρ can expressed as where A and I are closed convex sets and S is a subspace, then we say ρ is a geometrically decomposable penalty function. This form is general; if ρ can be expressed as a sum of support functions, i.e.
then we can set A and I to be sums of the sets C 1 , . . . , C k to express ρ in geometrically decomposable form (2.1). In many cases of interest, A + I is a norm ball and h A+I = h A + h I is the dual norm. In our analysis, we assume 1. A and I are bounded.
We do not require A+I to contain a neighborhood of the origin. This generality allows for unpenalized variables.
The notation A and I should be as read as "active" and "inactive": span(A) should contain the true parameter vector and span(I) should contain deviations from the truth that we want to penalize. E.g. if we know the sparsity pattern of the unknown parameter vector, then A should span the subspace of all vectors with the correct sparsity pattern.
The third term enforces a subspace constraint θ ∈ S because the support function of a subspace is the characteristic function of the orthogonal complement: Such subspace constraints arise in many problems, either naturally or after reformulation. We give two examples of M-estimators with geometrically decomposable penalty functions from statistical learning.

The lasso and group lasso penalties
Two geometrically decomposable penalties are the lasso and group lasso penalties. Let A and I be complementary subsets of {1, . . . , p}. We can decompose the lasso penalty component-wise to obtain where h B∞,A and h B∞,I are support functions of the sets We can also decompose the group lasso penalty group-wise (A and I are now complementary sets of groups) to obtain h B (2,∞),A and h B (2,∞),I are support functions of the sets B (2,∞),A = θ ∈ R p | max g∈G θ g 2 ≤ 1 and θ g = 0, g ∈ I B (2,∞),I = θ ∈ R p | max g∈G θ g 2 ≤ 1 and θ g = 0, g ∈ A .

The generalized lasso penalty
Another geometrically decomposable penalty is the generalized lasso penalty [28]. Let D ∈ R m×p and A and I be complementary subsets of {1, . . . , m}. We can express the generalized lasso penalty in decomposable form: h D T B∞,A and h D T B∞,I are support functions of the sets We can also formulate any generalized lasso penalized M-estimator as a linearly constrained, lasso penalized M-estimator. After a change of variables, a generalized lasso penalized M-estimator is equivalent to The lasso penalty can be decomposed component-wise to obtain The subspace constraint θ ∈ N (D) can be enforced with the support function of R(D) ⊥ . This yields the optimization problem minimize θ∈R k ,γ∈R p There are many interesting applications of the generalized lasso in signal processing and statistical learning. We refer to Section 2 in [28] for some examples.

Main result
We assume the unknown parameter vector θ ⋆ is contained in the model subspace We say an estimateθ is consistent (in the ℓ 2 norm) if the estimation error in the ℓ 2 norm decays to zero as sample size grows: We sayθ is model selection consistent if the estimator selects the correct model with probability tending to one as sample size grows: Pr(θ ∈ M ) → 1 as n → ∞.
Before we state our main result, we state our assumptions on the problem. These assumptions are stated in terms of the sample Fisher information matrix : We use B r (x) to denote the ball (in the ℓ 2 norm) of radius r centered at x, i.e.
Assumption 3.1 (Restricted strong convexity). We assume the loss function ℓ (n) is locally strongly convex with constant m over the model subspace, i.e.
We require this assumption to make the maximum likelihood estimate unique over the model subspace. Otherwise, there is no hope for consistency. This assumption requires the loss function to be curved along certain directions in the model subspace and is very similar to the notion of restricted strong convexity in [22] and compatibility in [4]. Intuitively, this assumption means the "active" predictors are not overly dependent on each other.
for some L > 0 and all θ 1 , θ 2 ∈ B r (θ ⋆ ) ∩ M . This condition automatically holds for all twice-continuously differentiable ℓ (n) , hence we do not state this condition as an assumption.
To obtain model selection consistency results, we must first generalize the irrepresentable condition for the lasso penalty to a geometrically decomposable penalty. We use P C to denote the orthogonal projector onto span(C): and γ C to denote the gauge function of a convex set C containing the origin: Assumption 3.2 (Irrepresentability). There exist τ ∈ (0, 1) such that where V is the infimal convolution of γ I and 1 S ⊥ If u I (z) and u S ⊥ (u) achieve V (z), then V (z) = γ I (u I (z)). Thus, if V (u) < 1, then (u I (z) ∈ relint(I). Thus the irrepresentable condition says we can decompose any z ∈ A into u I + u S ⊥ , where u I ∈ relint(I) and u S ⊥ ∈ S ⊥ .
Proof. First, we show V is positive homogeneous. For any α ≥ 0, we have Let u = αv. Then we have V also satisfies the triangle inequality: Thus V satisfies the triangle inequality.
Intuitively, the irrepresentable condition requires the active predictors to be not overly dependent on the inactive predictors. The irrepresentable condition is a standard assumption for model selection consistency and has been shown to be almost necessary for the sign consistency of the lasso [35,31]. We generalize their analysis to geometrically decomposable penalties in Section 6.
We also require there to be a finiteτ such that ) is a continuous function of x and attains its supremum over compact sets. Thusτ surely exists, so we do not state this requirement as an assumption. Finally, we state our main theorem and describe how to use this result. We use κ(ℓ p ) to denote the compatibility constant between a semi-norm p and the ℓ 2 norm over the model subspace (3.1): This constant quantifies how large p(x) can be compared to x 2 for all x ∈ M .
where ℓ p and ℓ * p are dual norms, then the penalized M-estimator is unique, consistent (in the ℓ 2 norm), and model selection consistent, i.e. the optimal solution to (1.1) satisfies In Section 5, we use this theorem to derive consistency and model selection consistency results for the generalized lasso and penalized likelihood estimation for exponential families.

Proof of the main result
We prove Theorem 3.4 by constructing a primal-dual pair for the original problem with the desired properties: consistency and model selection consistency. The proof consists of these steps: 1. Solve a restricted problem (4.1) that enforces the constraint θ ∈ M to obtain a restricted primal-dual pair, and show this restricted primal solution θ is consistent (cf. Propositions 4.1).
2. Establish a dual certificate condition that guarantees all solutions to the the original problem are also solutions to the restricted problem (cf. Proposition 4.2).
3. Construct a primal-dual pair for the original problem using the restricted primal dual pair that satisfies the dual certificate condition. This means the solution to the restricted problem is also the solution to the original problem.
This strategy is called the dual certificate or primal-dual witness technique [31]. First, we solve the restricted problem to obtain a restricted primal-dual pairθ,v A ,v M ⊥ . This restricted primal-dual pair satisfies the first order optimality condition We enforce the subspace constraintθ ∈ M , henceθ is model selection consistent. We also show thatθ is consistent. .
Then the solution to the restricted problem (4.1) is unique and satisfies Proof.θ solve the restricted problem, hence We rearrange to obtain θ ∈ M and ℓ (n) is locally strongly convex over R (and convex in general), hencê We assume θ − θ ⋆ 2 ≤ r and verify this assumption later. We take norms to obtain where ℓ * p is the dual norm to ℓ p . It is more convenient to bound the estimation error in the ℓ 2 norm, hence We substitute this bound into (4.5) to obtain . Remark 4.1. In some cases, this bound on the estimation error can be tightened. E.g., in some special instances of the generalized lasso, we can handle the first term in (4.5) more delicately to obtain a tighter bound. This allows us to use a smaller λ and reduces the sample complexity of the procedure.
Then, we establish a dual certificate condition that guarantees all solutions to the original problem satisfy h I (θ) = 0. Thus all solutions to the original problem are also solutions to the restricted problem.
The original problem (1.1) is convex, hence the optimal value is unique We subtract λ(u A,1 + u I,1 + u S ⊥ ,1 ) T θ 2 from both sides to obtain We rearrange this expression to obtain We substitute in (4.6) to obtain Both θ 1 and θ 2 are in S, hence we can ignore the terms u T S ⊥ ,2 θ 2 and u T S ⊥ ,1 θ 2 to obtain (u A,2 + u I,2 ) T θ 2 ≤ (u A,1 + u I,1 ) T θ 2 .
But we also know We combine these two inequalities to obtain This simplifies to u T I,2 θ 2 ≤ u T I,1 θ 2 . If u I,1 ∈ relint(I), then u T I,1 θ 2 = u T I,2 θ 2 if θ 2 has no component in span(I) u T I,1 θ 2 < u T I,2 θ 2 if θ 2 has a component in span(I).
But we also know u T I,2 θ 2 ≤ u T I,1 θ 2 . Thus we deduce θ 2 has no component in span(I) and h I (θ 2 ) = 0.
Finally, we use the restricted primal-dual pairθ,v A ,v M ⊥ to construct a primal-dual pair for the original problem (1.1). The optimality conditions of the original problem are ∇ℓ(θ) + λ(û A +û I +û S ⊥ ) = 0 (4.8) We setû I = arg min u γ I (u) . Hence theθ is also a solution to the original problem.
We seek to show the solution to the original problem is unique using Proposition 4.2 are satisfied. To do this, we must verifyû I is satisfies the dual certificate condition, i.e.û I ∈ relint(I).
A primal-dual solutionθ,v A ,v M ⊥ for the restricted problem (4.1) satisfies (4.3) and thus the zero reduced gradient condition: We Taylor expand ∇ℓ around θ ⋆ to obtain is the Taylor remainder term. We rearrange to obtain P M Q (n) P M is invertible over M , hence we can solve forθ to obtain We can Taylor expand (4.3) to obtain We substitute (4.10) into this expression to obtain where we used the fact that the row space of P M Q (n) P M is M . We need to show V (P M ⊥v M ⊥ ) < 1. Using the facts (i) V is a semi-norm, and (ii)v M ⊥ ∈ M ⊥ , we obtain a bound on V (P M ⊥v M ⊥ ): We use the irrepresentable condition to bound the first term: We deduce the second term is bounded byτ We select λ such that λ > 2τ τ ∇ℓ (n) (θ ⋆ ) p , hence To showû I ∈ relint(I), we must show Lemma 4.3. Suppose ℓ (n) is twice continuously differentiable. If the assumptions of Proposition 4.1 hold and we select λ such that Proof. The Taylor remainder term can be expressed as According to Taylor's theorem, these is a pointθ on the line segment between θ and θ ⋆ such that We add these two expressions to obtain ∇ℓ (n) is continuously differentiable, hence there exists L such that for all θ ∈ M in a ball of radius r at θ ⋆ . The assumptions of Proposition 4.1 hold, hence θ − θ ⋆ ≤ r and If we select λ such that then we can verify 2τ . We substitute this bound into (4.13) to obtain This meansû I ∈ relint(I), and by Proposition 4.2, all solutions to the original problem (1.1) satisfy h I (θ) = 0. Thusθ is also the unique solution to the original problem.

Examples
We use Theorem 3.4 to establish the consistency and model selection consistency of the generalized lasso and a group lasso penalized likelihood estimator in the high-dimensional setting. Our results are nonasymptotic, i.e. we obtain bounds in terms of sample size n and problem dimension p that hold with high probability.

The generalized lasso
Consider the linear model y = X T θ ⋆ + ǫ, where X ∈ R n×p is the design matrix, and θ ⋆ ∈ R p are unknown regression parameters. We assume the columns of X are normalized so x i 2 ≤ √ n. ǫ ∈ R n is i.i.d., zero mean, sub-Gaussian noise with parameter σ 2 .
We seek an estimate of θ ⋆ with the generalized lasso: where D ∈ R m×p . The generalized lasso penalty is geometrically decomposable: h D T B∞,A and h D T B∞,I are support functions of the sets The sample fisher information matrix is Q (n) = 1 n X T X. Q (n) does not depend on θ, hence the Lipschitz constant of Q (n) is zero. The restricted strong convexity constant is The columns of X are normalized so x T i ǫ is sub-Gaussian and satisfies a Hoeffding-type inequality (cf. Proposition 5.10 in [30]): By the union bound over i = 1, . . . , p, we have If we select λ > 2 √ 2στ τ log p n , then there exists c such that Thus the assumptions of Theorem 3.4 are satisfied with probability at least 1 − 2 exp(−cλ 2 n), and we deduce the generalized lasso is consistent and model selection consistent.
Corollary 5.1. Suppose y = Xθ ⋆ + ǫ, where X ∈ R n×p is the design matrix, θ ⋆ are unknown regression parameters, and ǫ is i.i.d., zero mean, sub-Gaussian noise with parameter σ 2 . If we select then, with probability at least 1 − 2 exp −cλ 2 n , the generalized lasso is unique, consistent, and model selection consistent, i.e. the optimal solution to (5.1) satisfies

Learning exponential families with redundant representations
Suppose X is a random vector, and let φ be a vector of sufficient statistics.
The exponential family associated with these sufficient statistics is the set of distributions with the form where θ are the natural parameters and A is the log-partition function: where µ is some reference measure. Assuming this integral is finite, A ensures the distribution is normalized. The set of θ such that A(θ) is finite is called the domain of this exponential family: If the domain is open, then this is a regular exponential family. In this case, A is an analytic function so its derivatives exist and cannot grow too quickly: Thus the gradient and Hessian of A are locally Lipschitz continuous, i.e. Lipschitz continuous in a ball of radius r around θ ⋆ : ∇A(θ) and ∇ 2 A(θ) are the (centered) moments of the sufficient statistic: L 1 and L 2 can be expressed in terms of the operator norm of ∇ 2 A and ∇ 3 A: where E θ is the expectation with respect to the distribution with parameters θ: Suppose we are given samples x (1) , . . . , x (n) drawn i.i.d. from an exponential family with unknown parameters θ ⋆ ∈ R p . We seek a group lasso penalized maximum likelihood estimate (MLE) of the unknown parameters: and θ 2,1 is the group lasso penalty If the exponential family has a redundant representation, then each distribution in this family is associated with an affine subspace of the parameter space. The constraint θ ∈ S makes the solution is unique even when exponential family has a redundant representation. Many undirected graphical models can be naturally viewed as exponential families. Thus estimating the parameters of exponential families is equivalent to learning undirected graphical models, a problem of interest in many statistical, computational and mathematical fields. We refer to Section 2.4 in [32] for some examples of graphical models.
We can decompose group lasso penalty group-wise to obtain where h B (2,∞),A and h B (2,∞),I are support functions of the sets B (2,∞),A = θ ∈ R p | max g∈G θ g 2 ≤ 1 and θ g 2 = 0, g ∈ I B (2,∞),I = θ ∈ R p | max g∈G θ g 2 ≤ 1 and θ g 2 = 0, g ∈ A , We enforce the subspace constraint using the support function of S ⊥ . Thus we can express (5.2) as The sample fisher information matrix is Q (n) does not depend on the sample, hence if the population Fisher information matrix Q = ∇ 2 A satisfies Assumptions 3.1 and 3.2, then Q (n) also satisfies these assumptions. If the model is identifiable over the feasible subspace S, then Q satisfies Assumption 3.1 because Q is strictly convex over S, hence strongly convex in a compact subset of S.
We select λ such that First we show that if ∇A is Lispchitz continuous in B r (θ ⋆ ), then the components of ∇ℓ (n) are sub-exponential random variables. Thus they satisfy a Bernsteintype inequality (cf. Proposition 5.16 in [30]) Pr ∇ℓ Lemma 5.2. Suppose X is distributed according to a distribution in the exponential family and ∇A is Lipschitz continuous with constant L in a ball of radius r around θ ⋆ . Then for |t| ≤ r, Proof. ∇ℓ (n) (θ) can be expressed as hence we can express the m.g.f. of ∇ℓ .
∇A is Lipschitz continuous in ball of radius r around θ ⋆ so if |t| ≤ r, then We substitute this bound into (5.4) to obtain the desired bound on the m.g.f.: By the Bernstein-type inequality (5.3) and Lemma 5.2, we deduce Pr ∇ℓ We take a union bound over the groups to obtain If we select then there exist c such that We also require hence the sample size n must be larger than (5.6) The model subspace M is the set {θ | θ g = 0, g ∈ I; θ ∈ S} and the compatibility constants κ(ℓ 2,∞ ), κ(ℓ 2,1 ), κ(h A ) are We substitute these expressions into (5.6) to deduce n must be larger than If we select λ according to (5.5) then, with probability at least 1 − 2 max g∈G |g| exp(−cλ 2 n), the penalized maximum likelihood estimator is unique, consistent, and model selection consistent, i.e. the optimal solution to (5.2) satisfies |A|λ.
We Taylor expand ∇ℓ around θ ⋆ to obtain is the Taylor remainder term.θ ∈ M , so this is equivalent to We rearrange to obtain We can Taylor expand (6.6) around θ ⋆ to obtain We substitute (6.7) into this expression to obtain This expression is equivalent to where ξ (n) is the sampling error We project onto M ⊥ to obtain We substitute (6.2) into this expression to obtained the desired result.
Remark 6.1. Lemma 6.1 states a necessary condition forθ ∈ B r (θ ⋆ ) ∩ M . To use this result to deduce the necessity of the irrepresentable condition, we must show if the irrepresentable condition is violated, then there is δ > 0 such that Pr(P M ⊥ ξ (n) ∈ right side of (6.4)) ≤ 1 − δ.
Since (6.4) is necessary forθ ∈ B r (θ ⋆ ) ∩ M , we must have For example, consider the linear model y = X T θ ⋆ + ǫ, where X ∈ R n×p is the design matrix, θ ⋆ ∈ R p are unknown regression parameters, and ǫ ∈ R n is i.i.d., zero mean Gaussian noise. We seek a generalized lasso estimate of θ ⋆ minimize θ 1 2n y − Xθ where D ∈ R m×p . Let Q be the sample covariance. The sampling error is Since P M ⊥ ξ (n) is a zero mean Gaussian, we must have for any convex set not containing a relative neighborhood of the origin The generalized lasso penalty is geometrically decomposable where D T B ∞,A and D T B ∞,I are the sets For r sufficiently small, ∂h D T B∞,A (B r (θ ⋆ ) ∩ M )) = ∂h D T B∞,A (θ ⋆ )).
then the right side of (6.10) is a convex set not containing a relative neighborhood of the origin. We deduce Pr(P M ⊥ ξ (n) ∈ right side of (6.10)) ≤ 1 2 .

Computational experiments
We show some consequences of Corollary 5.3 with experiments on two models from structure learning of networks that are motivated by bioinformatics applications. We select λ to be proportional to (max g∈G |g|) log |G| n and use a proximal Newton-type method [17] to solve the penalized likelihood maximization problem.

Graphical lasso
Suppose we are given samples drawn i.i.d. from a normal distribution. We seek a penalized MLE of the inverse covariance matrix: where Σ denotes the sample covariance matrix. We use a ℓ 1 /ℓ 2 penalty to promote block sparse inverse covariance matrices. λ is a parameter that tradesoff goodness-of-fit and sparsity. This estimator is a group variant of the graphical lasso [8].
We create a group sparse Gaussian MRF with a random group structure (see Figure 1). The nonzero entries of the inverse covariance matrix are drawn i.i.d.
(uniform) between 0 and 1. We draw samples and use the grouped graphical lasso to estimate the inverse covariance matrix. In these experiments, we varied the number of variables p from 64 to 225 and the sample size from 100 to 1000.
We estimate the probability of correct model selection using the fraction of 100 trials when the grouped graphical lasso correctly estimates the true group structure. Figure 2 plots the frequency of correct group structure selection versus the sample size n for four graphs with 64, 100, 144, and 225 nodes.
The fraction of correct model selection is small for small sample sizes but grows to one as the sample size increases. Naturally more samples are required to learn a larger model, hence the curves for larger graphs are to the right of curves for smaller graphs. If we plot these curves with the x-axis rescaled by 1/((max g∈G |g|) log |G|), then the curves align. This is consistent with Corollary 5.3 that say the effective sample size scales logarithmically with |G|.
We create a mixed model with 10 continuous variables and 10 binary variables (see Figure 3a). We estimate the probability of correct model selection using the fraction of 100 trials when the estimator correctly estimates the true group structure. Figure 3 plots the fraction of correct group structure selection versus the sample size n.
The fraction of correct model selection is small for small sample sizes but grows with the sample size. The fraction of correct model selection with the penalized PLE grows to one but the fraction with the penalized MLE stays around 0.9. This can be explained by the penalized MLE violating the irrepresentable condition. We refer to Section 3.1.1 in [25] for a similar example where the the irrepresentable condition holds for a neighborhood-selection estimator but fails for the penalized MLE.

Conclusion
We proposed the notion of geometric decomposability and generalized the irrepresentable condition to geometrically decomposable penalties. This notion of decomposability builds on those by Negahban et al. [22] and Candés and Recht [7] and includes many common sparsity inducing penalties. This notion of decomposability also allows us to enforce linear constraints.
We developed a general framework for establishing the model selection consistency of M-estimators with geometrically decomposable penalties. Our main result gives deterministic conditions on the problem that guarantee consistency and model selection consistency. We combine our main result with probabilis-(a) The graph topology used in this experiment. The blue nodes are continuous variables and the red nodes are discrete variables. The actual experiment had 10 continuous and 10 discrete variables. tic analysis to establish the consistency and model selection consistency of the generalized lasso and group lasso penalized maximum likelihood estimators.