Estimator Augmentation with Applications in High-Dimensional Group Inference

To make inference about a group of parameters on high-dimensional data, we develop the method of estimator augmentation for the block Lasso, which is defined via the block norm. By augmenting a block Lasso estimator $\hat{\beta}$ with the subgradient $S$ of the block norm evaluated at $\hat{\beta}$, we derive a closed-form density for the joint distribution of $(\hat{\beta},S)$ under a high-dimensional setting. This allows us to draw from an estimated sampling distribution of $\hat{\beta}$, or more generally any function of $(\hat{\beta},S)$, by Monte Carlo algorithms. We demonstrate the application of estimator augmentation in group inference with the group Lasso and a de-biased group Lasso constructed as a function of $(\hat{\beta},S)$. Our numerical results show that importance sampling via estimator augmentation can be orders of magnitude more efficient than parametric bootstrap in estimating tail probabilities for significance tests. This work also brings new insights into the geometry of the sample space and the solution uniqueness of the block Lasso.


Introduction
There has been a fast growth of high-dimensional data in many areas, such as genomics and the social sciences. Statistical inference for high-dimensional models becomes a necessary tool for scientific discoveries from such data. For example, significance tests have been performed to screen millions of genomic loci for disease markers. These applications have motivated the recent development in high-dimensional statistical inference. Some methods make use of sample splitting and subsampling to quantify estimation errors and significance [10,11,25], while others rely on the bootstrap to approximate the sampling distributions of lasso-type estimators [3,29]. For Gaussian linear models, an interesting idea of * Research supported by NSF grants DMS-1055286 and IIS-1546098.

Group inference
In this article, we consider a linear model where β 0 ∈ R p is the unknown parameter of interest, y ∈ R n is a response vector, X = [X 1 | · · · | X p ] ∈ R n×p is a design matrix and ε ∈ R n is an error vector with mean zero and variance σ 2 . Define N k = {1, . . . , k} for an integer k ≥ 1. We are interested in making inference about a group of the parameters, β 0G = (β 0j ) j∈G , for G ⊂ N p under a high-dimensional setting that p > n. To be specific, the goal is to test the null hypothesis H 0G : β 0G = 0 and construct confidence regions for β 0G . These are arguably the most general inference problems, including individual inference about β 0j as a special case when we choose G to be a singleton. Group inference arises naturally in applications where predictors have a block structure. For instance, inference about a group of genomic loci within the same gene for its association with a disease can identify responsive genes for the disease. Even if there is no application-driven block structure among the predictors, group inference may still be useful. By grouping variables, one can detect signals that are too small to detect individually. High correlation among predictors is a well-known difficulty for the lasso and related individual inference approaches. In this situation, grouping highly correlated predictors with the group lasso [26] will greatly stabilize the inference and increase detection power. Due to these advantages and practical usage, a few methods have been developed in recent papers for group inference. A de-biased group lasso is proposed by Mitra and Zhang [12] as a generalization of the de-biased lasso for more efficient group inference. van de Geer and Stucky [23] define a de-sparsified estimator for β 0G with a surrogate Fisher information matrix constructed by a multivariate square-root lasso. Meinshausen [9] develops the group-bound method to construct a one-sided confidence interval for β 0G 1 and shows that it is possible to detect the joint contribution of a group of highly correlated predictors even when each has no significant individual effect. Zhou and Min [30] establish that a modified parametric bootstrap is asymptotically valid for the group lasso and demonstrate the advantages of grouping in finite-sample inference.
A large portion of the above methods perform statistical inference based on the sampling distribution of an estimatorb =b(β, y, X) constructed as a function ofβ, which is either the lasso or the group lasso depending on whether group structure is used. Examples of such an estimatorb include the de-biased lasso, the de-biased group lasso, and the trivial caseb =β in those methods that directly estimate the distribution ofβ. There are two big challenges in these approaches. First, the finite-sample distribution ofb is not well-understood, due to the high dimension (p > n) and the sparsity ofβ. Consequently, the bootstrap has been used to approximate this distribution for inference. Although the de-biased estimators follow a nice asymptotic normal distribution when n → ∞, they can be far from normally distributed when n is finite. Indeed, recent papers have proposed to bootstrap the de-biased lasso as a better alternative [5,28]. Then, here comes the second challenge: How to efficiently simulate from the bootstrap distribution or an estimated sampling distribution ofb? Without an explicit characterization of the finite-sample distribution of the group lasso (or lasso), it appears that one can only rely on bootstrap, which can be computationally inefficient, or even impractical, for some calculations, such as approximating tail probabilities in significance tests and conditional sampling given a selected model in post-selection inference.

Contributions of this work
To meet the aforementioned challenges in group inference, we develop the method of estimator augmentation for the block lasso. Partition the predictors into J disjoint groups G j ⊂ N p for j = 1, . . . , J. For β = (β 1 , . . . , β p ), let β (j) = (β k ) k∈Gj for j ∈ N J . Given α ∈ [1, ∞], the block lasso is defined via minimizing a penalized loss function L(β; α): where · denotes the Euclidean norm. The weight w j > 0 usually depends on the group size p j = |G j |. The regularizer is the block-(1, α) norm (when w j = 1) of β, hence the name block lasso. Note that the lasso and the group lasso can be regarded as the special cases of α = 1 and α = 2, respectively. In the context of group inference, we can always choose a partition so that G = G j for some j, which translates our task into inference about β 0(j) using some function ofβ. Instead of the distribution ofβ, we work with the joint distribution of a so-called augmented estimator (β, S), where S = S(y, X) ∈ R p is a vector. Under a particular choice of S, we are able to obtain a closed-form density for the exact distribution of the augmented estimator for any finite n and p and for all α ∈ [1, ∞]. Given the density, one may use Monte Carlo methods, such as importance sampling, to draw from the joint distribution and simultaneously obtain samples ofβ and any function ofβ, such as the estimator b used in an inferential method. This method serves as a powerful and efficient alternative to parametric bootstrap forb, and can be applied in any group inference approach that utilizes some function of the block lasso. Estimator augmentation is especially useful in determining the significance in a hypothesis test and approximating [b |β ∈ B] for some event B. In both scenarios, we need to sample from a rare event, which is known to be difficult and sometimes impossible for the bootstrap. We will demonstrate such applications with two group inference approaches, one using the group lasso and the other a de-biased group lasso.
Estimator augmentation [29] was first developed for the lasso, which does not respect any group structure. Generalizing the method to the block lasso for all block norms (α ∈ (1, ∞]) turns out to be very challenging technically. The sample space of the augmented estimator (β, S), which can be represented by a collection of manifolds with nonzero curvature, becomes more complicated. The joint distribution is thus defined over a curved space, a significant distinction from the augmented lasso estimator. Along the development, we also identify a set of sufficient conditions for solution uniqueness for the block lasso, which are weaker and more transparent than known results. In fact, the method of estimator augmentation applies to a large class of regularized estimators. To illustrate this view and promote its practical use, we further apply our method to find the joint density of an augmented scaled block lasso (β, S,σ), which has a coherent estimatorσ 2 for the error variance. When we were finalizing the first version of this paper, Tian Harris et al. posted a preprint [19], in which they generalize the technique of estimator augmentation to derive densities for selective sampling in a randomized convex learning program. This exemplifies that estimator augmentation may have much wider applications than what has been considered in our paper.
In addition to the above theoretical contributions, the significance of this work is also seen from its application in group inference, especially when the group size p j is large. Mitra and Zhang [12] prove that de-biasing a scaled group lasso can achieve an efficiency gain in group inference by a factor of √ p j over a de-biased lasso. Zhou and Min [30] show that bootstrap inference with the group lasso can reach an optimal rate of n −1/2 if log J = O(p j ), which never holds for the lasso (p j = 1) in the high-dimensional setting p n. These results demonstrate the benefit of group sparsity in making inference about a group of parameters. Our development of estimator augmentation for the block lasso enables efficient simulation from the sampling distributions of the group lasso and the de-biased group lasso, which is an essential component in practical applications of these inferential approaches.
Notation used in this paper is defined as follows. Let A ⊂ N p be an index set. For a vector v = (v j ) 1:p , we define v A = (v j ) j∈A . For a matrix M = (M ij ) n×p , write its columns as M j , j = 1, . . . , p, and define M A = (M j ) j∈A as a matrix of size n × |A| consisting of columns in A. Similarly, we define We use row(M ) and null(M ) to denote the row space and the null space of M , respectively. Let diag(M, M ) be the block-diagonal matrix with M and M as the diagonal blocks. Denote by φ n (•; c) the density of N n (0, cI n ) for c > 0. Let We may suppress (m − 1) and simply write S α when the dimension does not need to be specified explicitly.
Throughout the paper, let α * be conjugate to α in the sense that 1 α + 1 α * = 1. We will assume that α ∈ (1, ∞) unless noted otherwise in next three sections, and leave to Section 5.1 the special case α = ∞ whose technical details are slightly more complicated. Although not the focus of this paper, the results for the lasso can be obtained as another special case (α = 1) after some simple modifications of the results for α ∈ (1, ∞). The block norm reduces to the usual 1 norm when α = 1, effectively ignoring the block structure, and thus in this case we always assume p j = 1 for all j ∈ N J without loss of generality.

The basic idea
In this section, we give an overview of the idea of estimator augmentation. We start with the Karush-Kuhn-Tucker (KKT) conditions for the minimization problem (1.2). Under uniqueness of the block lasso, we will establish a bijection between y and the augmented estimator and derive the joint density of its sampling distribution. We note that solution uniqueness for the block lasso is an interesting topic in its own right, and the sufficient conditions in this work are much more transparent than those in the existing literature.

The KKT conditions
Denote by sgn(·) the sign function with the convention that sgn(0) = 0. For a scalar function f : (2.1) Denote its inverse function by η −1 (x) = sgn(x)|x| 1/ρ . Some basic properties about η are given in Lemma A.1 in Appendix A. In particular, η(v) for v ∈ S α * , interpreted in the sense of (2.1), is a bijection from S α * onto S α . This fact is used in (2.2) below.
which is a p × p matrix. The KKT conditions of (1.2), which are both sufficient and necessary, are for a vector S satisfying (2.2).
Definition 2. Let S be defined by (2.2) and (2.3). We will call (β, S) ∈ R 2p an augmented solution to the block lasso problem (1.2). When we study the sampling distribution ofβ, the random vector (β, S) will be called an augmented estimator.
If (β, S) is unique for each y, then (2.3) defines a bijective mapping from the space of (β, S) onto the space of y, which is the inverse of the minimization program (1.2) that maps y to (β, S). From the density of y or ε, it is hopeful to derive the joint density of the augmented estimator (β, S) via this bijective mapping. Then one may apply Monte Carlo methods, such as Markov chain Monte Carlo (MCMC) and importance sampling, to draw from the joint distribution of the augmented estimator. As a marginal distribution, the sampling distribution ofβ can be readily approximated by Monte Carlo samples, as well as any function of (β, S). This is the basic idea of estimator augmentation. Although the idea seems intuitive, there are a few technical difficulties in the implementation: 1. To establish the uniqueness of (β, S) under fairly general situations. 2. To characterize the sample space for (β, S), which appears to be a 2pvector but in fact lives in the union of a finite number of n-dimensional manifolds. This makes the aforementioned bijection conceivable since ε ∈ R n . 3. To calculate the Jacobian of the mapping and obtain the target density via a change of variable.
We will establish the solution uniqueness in the remainder of this section, and take care of the other two major steps in Section 3. Although the basic idea follows from that in [29], there are substantial new technical issues in each of the three steps, which will be discussed in the sequel.

Uniqueness
We This lemma covers the full domain of α, including the boundary cases α = 1 (lasso) and α = ∞. Hereafter, we call S the subgradient vector due to its uniqueness. Next, we establish the uniqueness ofβ and thus the uniqueness of the augmented solution (β, S). The lasso solution is unique if the columns of X are in general position [21], which says that the affine span of the set {s j X j : s j ∈ {−1, 1}, j ∈ K ⊂ N p } for every |K| ≤ n ∧ p does not contain any ±X i for i / ∈ K. We generalize this definition to establish the uniqueness of the block lasso. Let U = 1 n X T ε ∈ R p , and denote the Gram matrix by Ψ = 1 n X T X hereafter. The KKT conditions (2.3) can be written as Since U ∈ row(X) and Ψ(β − β 0 ) ∈ row(X), we have The following assumptions are sufficient for the main results of this work. The two assumptions are quite weak. Assumption 1 simply states that X does not satisfy any additional linear constraint other than those that must be satisfied by any n × p matrix. If the entries of X are drawn from a continuous distribution, then Assumption 1 holds with probability one. To help understand the intuition behind Assumption 2, choose W = I p to simplify the exposition. Then this assumption is imposed on the vectors X (j) v (j) , where v (j) = η(s (j) ) ∈ S α (Lemma A.1) and s ∈ V. Under Assumption 1 with n ≤ p, dim(V) = n and v = η(s) ∈ R p has only n free coordinates. Thus, we essentially require linear combinations of disjoint subsets of any n columns of X be in general position, which is a mild condition in practice. For the special case of the lasso with p j = 1, Assumption 2 reduces to that the columns of X are in general position. Theorem 2.2. Suppose that Assumption 2 holds. Then for any λ > 0 and y ∈ R n , the solutionβ to the block lasso problem Since solution uniqueness is a topic of independent interest, we make a brief comparison to some existing results. Theorem 2.2 unifies a few important special cases, including the lasso (α = 1) and the group lasso (α = 2). For α = 1, this theorem is comparable to the result in [21], while the existing results about the uniqueness of the group lasso involve conditions that are much less transparent than the one stated here. As an example, Theorem 3 in [17] states that, under Assumption 1, the group lasso solutionβ (with α = 2) is unique if (i) |G A | ≤ n, where A = G(β) is the active groups, and (ii) A = {j ∈ N J : S (j) = 1}.
Unlike Assumption 2 which is imposed on X explicitly, conditions (i) and (ii) are implicit in nature and can be verified only after a particular solution is calculated. According to Theorem 2.2, it is possible to have a unique solution when |G A | > n as long as |A| ≤ n, i.e., there are no more than n active groups but the total number of active coefficients ofβ could be greater than n. Such cases are not covered by the result in [17]. As will become clear in next section, the set ofβ satisfying (i) and (ii) is a proper subset of the full space of unique solutions and thus, in general, will have a probability mass strictly less than one.

Estimator augmentation
We will go through the main steps in detail to derive the joint density of the augmented estimator (β, S), which is useful for understanding this method. Section 3.1 characterizes the sample space of (β, S), Section 3.2 defines explicitly the bijective mapping from the KKT conditions, and Section 3.3 derives the joint density. A few concrete examples will follow in Section 3.4 to illustrate the method. The joint density of (β, S) depends on the true parameter β 0 and the error distribution. We will discuss in Section 4 how to apply estimator augmentation in high-dimensional inference. By default, we assume p ≥ n. The results for p < n will be obtained as special cases.

Sample space
Denote byγ = (γ j ) ∈ R J the vector of the norms ofβ (j) , i.e.γ j = β (j) α . It follows from (2.2) thatβ (j) =γ j η(S (j) ) for all j ∈ N J . Thus, the augmented estimator (β, S) can be represented by the triplet (γ A , S, A), where A = G(β) is a random subset of N J when considering the sampling distribution. Given A = A for a fixed subset A ⊂ N J , it is seen from (2.2) and (2.5) that the sample space for S is for j ∈ A and dim(V) = n under Assumption 1, M A is an (n−|A|)-manifold in R p if |A| ≤ n: It is the product of unit α * -spheres and balls intersecting with the linear subspace V. Correspondingly, the space for (γ A , S) given A is Ω A = (R + ) |A| × M A , which is an n-manifold. Taking union over subsets of size ≤ n, we obtain the sample space for the augmented estimator (γ A , S, A): In summary, the sample space of the augmented estimator, represented by the triplet (γ A , S, A), is the union of a finite number of n-manifolds. Thus, it is possible to find a bijective mapping from this space to R n , the space for ε; see the mappingH to be defined in (3.10). For the lasso, Ω A degenerates to a union of n-dimensional polyhedra with zero curvature.

Remark 2.
Parameterizing the augmented estimator in terms ofγ and S is a critical choice for our derivations. In this way, all the equality constraints are imposed on S as in (3.1), leading to familiar geometry for the spaces ofγ and S, which is helpful for understanding distributions over these spaces. It is also a nature choice, since the subgradient S is alway unique (Lemma 2.1) and non-uniqueness comes solely fromγ (Lemma A.4).

A bijective mapping
which defines a mapping H : Ω → row(X) for any β 0 ∈ R p and λ > 0. For notational brevity, we often suppress its dependence on (β 0 , λ) and write the mapping as H(•). In what follows, we show that H is bijective, which is a consequence of the uniqueness of (β, S), or equivalently of (γ A , S, A).
The mapping H is established at a quite abstract level so far. It will be more convenient to work with the restriction of H to Ω A for A ∈ A , defined by Then H can be understood as a collection of one-to-one mappings {H A : A ∈ A } indexed by subsets of N J . Write the block lasso solution for the response y asβ =β(y). Let be the set of noise vectors v for which the active set of the block lasso solution β(Xβ 0 + v) is A. Denote the block norms and the subgradient ofβ

Q. Zhou and S. Min
The one-to-one mapping H A allows us to obtain the density for (γ A , S) from the density of the noise vector via a change of variable. It remains to find the differential of H A so that we can calculate the Jacobian for the change of variable. A special aspect of this mapping is that H A is defined on a manifold and thus its differential is determined with respect to local parameterizations. As an (n − |A|)-manifold in R p , a neighborhood of s ∈ M A can be parameterized by s F , where F ⊂ N p may depend on (s, A) and |F | = n −|A|. Correspondingly, the n-manifold Ω A will be parameterized by (r A , s F ) ∈ R n in a neighborhood of (r A , s). Under this parameterization, Lemma 3.2, proven in Appendix B.1, gives an expression of dH A in terms of a few matrices defined below. Let η : and D = D(s, A) ∈ R p×p to be a diagonal matrix whose diagonal elements and Assumption 1 holds. Then for any interior point The only reason that we excluded the case α = 1 in the above lemma is because η is not well-defined. We will cover this case, which reduces to the lasso, in Example 3.
Geometrically, the columns of T consist of a set of tangent vectors of the manifold M A while those of M consist of tangent vectors of the mapping H A . These tangent vectors determine the ratio between the volume element in the image space row(X) and that in the domain Ω A , and thus the Jacobian of the mapping.

Joint density
Now we make an explicit link from the augmented estimator (γ A , S, A) to the noise vector ε. Under Assumption 1, null(X T ) = {0} and thus by (3.4) We note thatH ∈ R n is the coordinates of H with respect to the basis (X T / √ n) of row(X). By Lemma 3.1,H is a bijection that maps Ω onto R n . DefineH A similarly as for H A in (3.5). It then follows from Lemma 3.2 that the Jacobian of which the matrix on the right side is of size n × n.
Suppose that Assumptions 1 and 2 hold. Then the distribution of the augmented estimator (γ A , S, A) for α ∈ (1, ∞) is given by the differential form See Appendix B.2 for a proof, from which we see that (3.12) is valid as long as the block lasso program (1.2) has a unique solution for almost all y ∈ R n . For each A ∈ A , the n-form dμ A defines a measure on Ω A in the following sense. Let k = n − |A| and 14) where the Jacobian ∂s F /∂u = 1 if Φ is parameterized by s F . Note that for a particular k-surface, parameterizations other than by s F may be more convenient. As shown in (3.14), the distribution of (γ A , S, A) is defined by a collection of measures, {μ A : A ∈ A }, due to the discrete nature of A, and f A is the density of μ A parameterized by θ = (r A , s F ). It is possible that f A = ∞ on a set of measure zero in Ω A . An important special case of the above integral is Lastly, summing over A in the above equation leads to

Remark 5.
Evaluation of the joint density in (3.12) for any (r A , s, A) ∈ Ω can be done by a simple procedure: 1. Find a local parameterization s F for s and the associated matrix T ; 2. Calculate the Jacobian J A (3.11), evaluate the mappingH A (r A , s) (3.10), and plug them into (3.13) to obtain f A (r A , s).
See Examples 1 and 2 for concrete illustrations.
As a consequence of Theorem 3.3, the density f A under i.i.d. Gaussian errors can be found in the following corollary. See Appendix B.3 for a proof.
As we have seen, the sample space Ω (3.2) for the augmented estimator is complex due to the many constraints involved in M A (3.1) and the mix of continuous and discrete components. It is quite surprising that one can find an exact joint density for the augmented estimator given β 0 and the noise distribution, which is usually simple under an i.i.d. assumption. The density gives a complete and explicit characterization of the sampling distribution according to (3.14). In light of the non-linear and sparse nature ofβ and the high dimension of the problem, the joint density itself is a significant theoretical result that improves our understanding of the block lasso estimator. Applications of this result in group inference will be discussed in Section 4.

Remark 6.
We summarize the main differences between the joint density of the augmented block lasso in Theorem 3.3 and that of the augmented lasso in [29]. First, the sample space Ω A is an n-manifold with nonzero curvature for α > 1, and consequently the density is specified in (3.12) by a differential form of order n. In contrast, the space Ω A has no curvature for the augmented lasso estimator, whose density can be defined with respect to the Lebesgue measure. Second, the Jacobian (3.11) depends on r A , s and A for the block lasso, while it only depends on A for the lasso. See Example 3 for the technical reason and a geometric interpretation for these differences. Both aspects result in new challenging computational issues in this work for the development of Monte Carlo algorithms, which are discussed in Section 4.2.
For the sake of completeness, we also give the density of (γ A , S, A) when p < n, which can be obtained by simple modifications of a few steps in the proof of the result for p ≥ n. See Appendix B.4 for details. Assume that rank(X) = p < n, which is sufficient for both Assumptions 1 and 2 to hold. Then row(X) and V (2.5) are identical to R p , which implies every s ∈ M A (3.1) can be locally An interesting observation is that we need the density g n,X of U = n −1 X T ε when p < n, which is more difficult to determine than the density of ε needed in the high-dimensional case (3.12). The underlying reason for this can be found from the sufficient statistic t = X T y (2.3). When p < n, the dimension of t is smaller than the sample size n and thus this statistic achieves the goal of data reduction. Consequently, one needs the distribution of t or U to determine the sampling distribution ofβ. However, when p ≥ n, y is the coordinates of the statistic t using the rows of X as the basis and thus the two are equivalent up to a change of basis, in which case the distribution of y or ε is all we need.

Examples
We illustrate the distribution of the augmented estimator with a few examples. Example 1 is a simple concrete example that demonstrates various key concepts, including the sample space, the density, and probability calculations. The second example shows that, under an orthonormal design, the joint distribution given by Theorem 3.3 coincides with the result from block soft-thresholding. The last example considers the lasso. Technical details involved in these examples are deferred to Appendix C. Example 1. Consider a simple but nontrivial example with p = 3, n = 2, and J = 2. The two groups G 1 = {1, 2} and G 2 = {3}, and pick α = 2. Suppose that Put r = (r 1 , r 2 ) and s = (s 1 , s 2 , s 3 ). We first determine the space M A (3.1). Incorporating the constraint that the manifold M A can be expressed as , whose boundary ∂D ∅ consists of two arcs and two line segments connecting at  Figure 1 for illustration. Use ∂(q 1 , q 2 ) to denote the boundary of D ∅ from q 1 to q 2 along the positive orientation. It is then immediate that here. The two arcs in D {1} can be parameterized by s 1 and Ω {1} correspondingly by θ = (r 1 , s 1 ) with two domains, R + × (−1, 0) and R + × (0, 1). After a few steps of algebra, we arrive at the density In Appendix C.1, we provide the results for A = ∅, {2}, {1, 2}, and verify that P(A = A) indeed sums up to one. Example 2 (Orthogonal design). Suppose p = n = mJ, Ψ = I p , and put W = √ mI p . In this example, all the groups are of the same size m. It is known that under this setting, the group lasso (α = 2) is obtained by block softthresholding the least-squares estimatorβ = 1 n X T y: Assume ε ∼ N n (0, σ 2 I n ). Thenβ (j) ∼ N m (β 0(j) , (σ 2 /n)I m ), j ∈ N J , are mutually independent. The distribution of (γ A , S, A) for α = 2, derived in Appendix C.2, is given by in which is a chosen set of (m − 1) free coordinates of s (j) , and | det M (jj) | has a closedform expression (C. 10). In what follows, we exemplify that (3.22) is consistent with block soft-thresholding (3.21).
Since dμ A factorizes into a product of J terms, different groups (γ j , S (j) ) are mutually independent. The density f j (r j , s (j) ) determines the distribution of (γ j , S (j) ). If j / ∈ A, letting r j = 0 we have It then follows that where the last equality comes from the distribution ofβ (j) . This result is clearly consistent with the soft-thresholding rule (3.21). Next we calculate P(γ j > t) for j ∈ A. To simplify our derivation, assume further that β 0(j) = 0. Integrating f j dθ (j) over the sphere s (j) ∈ S m−1 , the marginal density ofγ j is for r j > 0. It then follows that, for t ≥ 0, which again coincides with the result from soft-thresholding. See Appendix C.2 for the derivation of (3.24) and (3.25).
Example 3 (lasso). When α = 1 in (1.2), the block lasso reduces to the lasso with no group structure. Thus, the result for α = 1 can be deduced by letting p j = 1 for all j and α = 2 (or any α > 1) in Theorem 3.3. In this case, for j ∈ A the subgradient S j = sgn(β j ) ∈ {1, −1} is a function ofβ j . This leads to two special properties of the matrix T = T (s, A) defined in Lemma 3.2, which do not hold in the general case p j ≥ 2: (i) T = T (A) depends only on A, (ii) the submatrix T A• is a zero matrix; see Appendix C.3. Bearing these facts in mind, one can apply Theorem 3.3 to find the joint distribution of the augmented lasso, given by the density Owing to property (i), the Jacobian and the set F here do not depend on s, which is fundamentally different from the block lasso. A geometrical interpretation for (i) is that the space M A (3.1) for the lasso is a union of polyhedra and the set of tangent vectors that forms the columns of T is invariant at each s ∈ M A , while in the block lasso case M A is curved with a different tangent space at different points. This gives one of the aspects in which this work represents a highly nontrivial generalization to the result for the lasso. As adopted in [29], the augmented lasso estimator can also be represented by (β A , S B , A), where B = N p \ A is the set of zero components ofβ. With the change of variable,β j =γ j S j for j ∈ A, one can easily obtain the density under this alternative parameterization from (3.26), which is identical to the joint density in Theorem 2 of [29] with the choice of (X T / √ n) as a basis for row(X). See Appendix C.3 for the technical details.

Applications in statistical inference
In this section, we develop Monte Carlo methods to make inference about β 0 by utilizing the joint density of the augmented block lasso estimator. Recall that we want to test the hypothesis H 0,G : β 0G = 0 or to construct confidence regions for β 0G . Without loss of generality, assume G = G j for some j so that our goal is to infer β 0(j) . Denote the null hypotheses by H 0,j : β 0(j) = 0 for j ∈ N J .

Parametric bootstrap
Consider inference with an estimator in the form ofb =b(β, S) ∈ R p , a mapping of the augmented estimator (β, S). One such approach that has drawn recent attention is the de-biased lasso and its generalization to the de-biased group lasso. Given a p × p matrixΘ =Θ(X), a form of the de-biased estimator may be expressed asb =β +ΘX T (y − Xβ)/n =β + λΘW S, (4.1) where (β, S) is either the augmented lasso or the augmented group lasso. Different de-biased estimators have been constructed with differentΘ, which is often some version of a relaxed inverse of the Gram matrix Ψ. It is usually impossible to obtain the exact distribution of (b − β 0 ) for a finite sample. Thus, bootstrap methods have been developed [5,28] with improved performance compared to asymptotic approximation for the de-biased methods.
Letb * =b(β * , S * ). Choosing a function h j : R pj → [0, ∞), we estimate its (1 − δ)-quantile h j,(1−δ) from a large bootstrap sample such that Then, a (1 − δ) confidence region for β 0(j) can be constructed in the form of By duality the p-value for testing H 0,j is approximated by the tail probability Common choices of h j include, for example, various norms and h j (θ) = X (j) θ . Although out of the scope of this paper, the asymptotic validity of (4.2) and (4.3) comes from the fact that (b (j) − β 0(j) ) is an asymptotic pivot with a careful choice ofΘ [12,22]. An interesting and key observation is that the joint density of [β * , S * |β] is explicitly given by (3.12) in Theorem 3.3, withβ in place of β 0 , through its equivalent representation. Denote this density (3.13) by f A (r A , s;β, σ 2 , λ) to emphasize its dependence on (β, σ 2 , λ). In principle, we can use Monte Carlo methods, such as importance sampling and MCMC, to draw (β * , S * ) and obtain a sample ofb * =b(β * , S * ), which serve as alternatives to the above bootstrap sampling. Monte Carlo methods may bring computational efficiency and flexibility compared to parametric bootstrap. In the following, we will demonstrate the efficiency of importance sampling in calculating tail probabilities as in (4.3), which is a prominent difficulty for the bootstrap. Monte Carlo methods for other applications, including those with an estimated error distribution, are discussed in Section 4.5.

Importance sampling
The following simple fact about the parameterization of M A (3.1), proved in Appendix B.5, is useful for designing proposal distributions in importance sampling.

Lemma 4.1. Let α ∈ (1, ∞). For each A ∈ A , the manifold M A , except for a set of measure zero, can be parameterized by s F such that the index set F = F (A) only depends on A.
A consequence of Lemma 4.1 is that we may use the same volume element dθ = dr A ds F almost everywhere in the subspace Ω A , which eases our development of a Monte Carlo algorithm. Suppose that q A (r A , s) is the density of a distribution over Ω with respect to dθ such that A Ω A q A (r A , s)dθ = 1. As long as the support of q A is Ω A for all A ∈ A , it can be used as a proposal distribution in importance sampling. With a little abuse of notation, put θ = (r A , s) ∈ Ω A so that (θ, A) represents a point in the sample space Ω at which the volume element is dθ. Suppose we want to estimate the expectation of a function h(β, S) = h(γ A , S, A) with respect to f A , using (β, S) and (γ A , S, A) interchangeably. Importance sampling can be readily implemented given the densities f A and q A . Draw (A t , θ t ) from the proposal q A (θ) for t = 1, . . . , N and calculate importance weights w t = f At (θ t )/q At (θ t ). Then by the law of large numbers, the weighted sample mean provides the desired estimate. To estimate the probability in (4.3), h is taken to be the indicator function of the event of interest. When the true β 0(j) = 0, the p-value (4.3) can be tiny, and bootstrap (Algorithm 1) may fail to provide a meaningful estimate of the significance level. In such cases, it is much more efficient to use importance sampling with a proposal distribution that has a higher chance to reach the tail of the bootstrap distribution f A (r A , s;β, σ 2 , λ).
We design two types of proposal distributions. The first type of proposals draw (β * , S * ) by the bootstrap algorithm P B(β † , Mσ 2 , λ † ) with a proper choice of (β † , M, λ † ), where M > 0 is a constant. The proposal distribution has density f A (r A , s; β † , Mσ 2 , λ † ), again by Theorem 3.3. By increasing the error variance with M > 1, choosing β † =β, and possibly with a different λ † , we can propose samples in the region of interest in (4.3) which has a small probability with respect to the target distribution. The Jacobian term J A (r A , s; λ) (3.11) is the time-consuming part in evaluating the densities for calculating importance weights. If we choose λ † = λ, however, this term will cancel out and the importance weight is simply the ratio of two normal densities, whose calculation is almost costless. Our empirical study shows that this choice gives comparable estimation accuracy and thus we always let λ † = λ. Denote by IS(β † , M) the importance sampling with the first type of proposals. Our second design uses a mixture of two proposal distributions with different β † and M , which has more flexibility in shifting samples to multiple regions of interest. Again the Jacobian term cancels out in the importance weight (4.4). Our importance sampling with a mixture proposal is detailed in the following algorithm. For brevity, writẽ which is identical to theH in (3.10).
Remark 7. The first algorithm IS(β † , M) can be regarded as a special case of Algorithm 2 with a 1 = 1, β † 1 = β † and M 1 = M . One can easily generalize Algorithm 2 to a mixture proposal with K ≥ 3 component distributions. For other error distributions, we simply replace φ n in (4.4) by g n , the density of ε/ √ n.
In our numerical results, the efficiency of an importance sampling estimate is measured by its coefficient of variation (cv) across multiple independent runs and compared with direct bootstrap outlined in Algorithm 1.

Group lasso
We begin with a simpler application to test the complete null hypothesis H 0 : β 0 = 0 using the statistic T = h(β) = j β (j) , whereβ is the group lasso for a particular λ. In this case, our target density f A (r A , s; β 0 = 0, σ 2 , λ) determines the exact distribution of T under H 0 . We set the group size p j = 10 for all groups and fixed σ 2 = 1. Each row of X was drawn from N p (0, Σ), where the diagonal elements of Σ are all 1. The offdiagonal elements Σ ij = ρ 1 if i, j are in the same group and Σ ij = ρ 2 otherwise. We simulated 30 datasets with parameters (n, p, ρ 1 , ρ 2 ) reported in Table 1. Put v = (1, 1, 1, 1, −1, −1, −1, −1, 0, 0). For the first 10 datasets, we chose β 0 = 0 so that H 0 is true. For the other 20 datasets, the first two groups of β 0 were active, with β 0(1) = β 0(2) = v/2 for datasets 11 to 20 and β 0(1) = β 0(2) = v for datasets 21 to 30. For each dataset, λ was chosen to be the smallest value such that the group lasso solution had two active groups. The range of λ and that of the statistic T across the simulated datasets are reported in Table 1 as well. We applied the algorithm IS(0, 5) to generate N = 100, 000 samples. Denote the samples byβ * t , with importance weight w t , for t = 1, . . . , N. The p-value for the observed statistic T was then estimated bŷ This procedure was repeated 20 times independently for each dataset to calculate the meanq and the standard deviation ofq (IS) , from which we calculated cv(q (IS) ). If we had used the bootstrap algorithm P B(0, σ 2 , λ) for the same N to estimate the p-value, denoted byq (P B) , its cv would have been close to (1 −q)/(Nq). Figure 2 plots log 10 (q), cv(q (IS) ) and log 10 {cv(q (P B) )/cv(q (IS) )} for the 30 datasets. We observe from the ratios of cv's in panel (c) that, for datasets 11 to 30, the importance sampling estimates are much more accurate, while the estimated p-values, as shown in panel (a), are very small. For many of these 20 datasets, the improvement of importance sampling over bootstrap can be five or more orders of magnitude. The p-values are insignificant for the first 10 datasets, in which the null hypothesis is true. In a majority of these cases, the importance sampling estimates are slightly less accurate than the bootstrap estimates, which is fully expected.

A de-biased approach
The second application concerns a de-biased group lasso in the form of (4.1). Since our method applies to any choice ofΘ, to simplify the discussion we set Θ = Σ −1 instead of using a particular estimate, where Σ is the population covariance of X. The test statistic is chosen as h j (b (j) ) = X (j)b(j) := T j in (4.3).
We simulated 20 datasets independently under the same settings as those for datasets 11 to 30 in Table 1. The tuning parameter λ was chosen by the same method as in Section 4.3 to calculate the group lassoβ and the de-biased estimateb (4.1) for each dataset. Figure 3 plots these two estimates for one dataset, in which β 0(1) = β 0(2) = v and β 0(j) = 0 for j > 2. We see that the de-biased group lassob is not sparse,b (j) = 0 for all j, and its first two groups are closer to the active groups of β 0 than the group lasso. This largely removed the shrinkage in the active coefficients of the group lasso solution and substantially reduced its bias. Our goal here is to test H 0,1 : β 0(1) = 0 by estimating the probability is not sensitive to the choice ofβ as long as it is sparse. Thus, we chooseβ =β, the group lasso estimate. See [5] for related discussions.
We designed the following mixture proposal for Algorithm 2 to approximate the p-value (4.3) by importance sampling: Note that β † 2(1) is the middle point betweenβ (1) and 0, serving as a bridge between the target distribution and the null hypothesis H 0,1 . To achieve a wider coverage of the sample space, the error variances of both component distributions were chosen to be greater than σ 2 . We applied Algorithm 2 to generate N = 100, 000 weighted samples (β * t , S * t ), with weights w t , for each dataset. Similar to (4.5), the p-value for T 1 was estimated aŝ whereb * t =b(β * t , S * t ) as in (4.1). We replicated this procedure 20 times independently to calculate the cv ofq (IS) as we did in the previous example. The same comparisons were conducted and the results are reported in Figure 4. Strong majority of the p-values were estimated to be significant, since β 0(1) = 0 for all 20 datasets. The cv's of the importance sampling estimates are seen to be quite small, which is especially satisfactory for those tiny tail probabilities on the order of 10 −10 or smaller. As shown in Figure 4(c), our importance sampling estimation is more efficient than parametric bootstrap for at least 13 out of the 20 datasets, many showing orders of magnitude of improvement. For most of the other datasets, the importance sampling results are very comparable to the results from bootstrap. Compared to the parametric bootstrap in Algorithm 1, the only additional step in our importance sampling algorithms is to evaluate importance weights, such as (4.4), of which the computing time is negligible relative to computing group lasso solutions. As a result, the total running time of the importance sampling is almost identical to that of the bootstrap sampling. The above two applications thus exemplify the huge gain in estimation accuracy by importance sampling via estimator augmentation at almost identical computing cost. It is worth mentioning that accurate estimation of small p-values is crucial for ranking the importance of predictors and controlling false discoveries in largescale screening.

Other applications
Given the joint density f A (r A , s;β, σ 2 , λ), one may design MCMC algorithms to draw samples (β * , S * ) from this distribution, which is identical to the distribution of a bootstrap sample generated by P B(β, σ 2 , λ) in Algorithm 1. The advantage of an MCMC algorithm is that it does not need to solve a convex optimization program in any of its steps. But evaluating the Jacobian term in f A could be time-consuming. Another potential application is conditional sampling from [β * , S * |β * ∈ B], which will be useful in post-selection inference. For example, conditioning on the model selected byβ, i.e. G(β * ) = G(β), we may wish to sample from an estimatorb * with a nice asymptotic distribution for inference. For this problem, bootstrap may be impractical since the conditioning event is often a rare event. However, from the joint density one can easily obtain the conditional density ∝ f G (r G , s), where G = G(β), and implement an MCMC algorithm to draw from this conditional distribution. In the case of the lasso, Zhou [29] implemented an Metropolis-Hastings sampler for such conditional sampling. The more general case for a block lasso will be considered in the future.
Under a Gaussian error assumption, it is a common practice to plug an estimated varianceσ 2 in the bootstrap P B(β,σ 2 , λ). As long asσ 2 is consistent with a certain rate, inference will be valid asymptotically [5,30]. Therefore, we can use our importance sampling algorithms with f A (r A , s;β,σ 2 , λ) as the target density. Note that the density f A (3.13) depends on the error distribution only through the density g n of ε/ √ n. Under a general i.i.d. error assumption, estimating g n reduces to estimating the density of an univariate distribution, which can be done quite accurately even when n is moderate by either a parametric or a nonparametric method. Given an estimateĝ n , our target density is readily obtained with g n replaced byĝ n . An appealing alternative is to de-bias a scaled block lasso, which estimates σ 2 in a coherent way, for inference as in [12]. Estimator augmentation can be applied to derive the joint density of an augmented scaled block lasso, including its variance estimatorσ 2 , as outlined in Section 5.2. Given the density, one can follow the same importance sampling algorithms for tail probability approximation.

Generalizations
We generalize estimator augmentation to the block lasso with α = ∞ and to a scaled block lasso. In both cases, the subgradient has more structure.

Block-(1, ∞) norm
In this subsection, we consider the case α = ∞ (α * = 1). The difference between this case and the case α < ∞ comes from the subgradient vector S. Let B j = argmax k∈Gj |β k | ⊂ G j , which may contain multiple elements when a tie occurs, and B c j = G j \ B j for j ∈ N J . It follows from Lemma A.3 that (i) forβ (j) = 0, S (j) 1 = 1 and where Bj t k = 1 and t k ≥ 0; (ii) S (j) 1 ≤ 1 forβ (j) = 0. Compared to (2.2), the discreteness of {S k = 0} for some k as in (5.1) distinguishes the (1, ∞) norm from other cases of α < ∞. Accordingly, the augmented estimator (β, S) will have more structure. Recall that the active blocks ofβ are denoted by A = G(β). For j ∈ A, define Put K = ∪{K j : j ∈ A} and K c = ∪{K c j : j ∈ A}. It follows from (5.1) that β Kj =γ j sgn(S Kj ) for j ∈ A, whereγ j = β (j) ∞ . We can then represent (β, S) subject to the constraints that β K c j ∞ ≤γ j for j ∈ A. For A = A ∈ A (3.3), Proposition A.6 implies that assuming solution uniqueness the range of K is where M A is as in (3.1) with α * = 1. The sample space for (γ j ,β K c j ) is the cone Then the sample space for (γ A ,β K c , S) is the product Ω A,K = ( j∈A C j ) × M A,K . Taking union over the range of the sets A ∈ A and K ∈ K (A) determines the space Ω for the augmented estimator (5.3). Compared to the case α < ∞, the subgradient S has lost |K c | free dimensions due to the constraints that S k = 0 for all k ∈ K c . Consequently, for every interior point s ∈ M A,K , there is a neighborhood that may be parameterized by s F with |F | = n − |A| − |K c | := q. Note that ds k = 0 for each k ∈ K c . Let I = N p \ K c . Similar to Lemma 3.2, we can then find a matrix T ∈ R (p−|K c |)×q such that ds I = T ds F . For notational brevity we will use (β, S) and its equivalent representation Theorem 5.1. Fix p ≥ n, β 0 ∈ R p and λ > 0. Suppose Assumption 1 holds and that the program (1.2) for α = ∞ has a unique solution for almost all y ∈ R n . Let g n be the density of (ε/ √ n). Then the distribution of the augmented estimator (β, S) is given by the n-form The sufficient condition for solution uniqueness in this case is discussed in Appendix A. 4. The density f A,K is defined in terms of the parameterization θ. Suppose that Γ ⊂ R |A|+|K c | is a subset of the product cone j∈A C j and which interprets the differential form (5.5). The density here differs from that in (3.12) only in the Jacobian term. Clearly, the same importance sampling method (Algorithm 2) can be used here due to the cancellation of the Jacobian.

A scaled block lasso
As another generalization, we consider estimator augmentation for a scaled block lasso, which is scale invariant and provides an estimate of σ 2 simultaneously. Mitra and Zhang [12] have developed inference methods via de-biasing a scaled group lasso, which is related to the square-root group lasso [2]. Following their formulation, we define a scaled block lasso Since the loss function in (5.6) is convex, (β,σ) is given by the KKT conditions It is easy to see that (5.7) and (5.9) imply (5.8), and thus are sufficient and necessary for (β,σ) to be a scaled block lasso solution. This shows that the subgradient S here satisfies an additional equality constraint. Therefore, for A = G(β) = A, its sample space is an (n −|A| −1)-manifold, where M A is defined in (3.1). Note that √ n(X T ) + z is the coordinates of z ∈ row(X) with respect to the basis (X T / √ n). Thus, the vector λW S for a scaled block lasso is normalized with respect to this basis. Let (γ A , S, A) be the equivalent representation of (β, S). Given A = A, the space for (γ A , S,σ) is which is still a manifold of dimension n. Suppose the program (5.6) has a unique minimizer for almost all y ∈ R n . Substituting y by Xβ 0 + ε, (5.7) defines a bijective mapping F : (r A , s,σ, A) Recall that (5.7) with S ∈ M A is equivalent to the KKT conditions, and thus this mapping is sufficient for determining the distribution of (β, S,σ). The restriction of F to a fixed A defines a one-to-one mapping F A : Ω A → R n , given by whereH is defined in (3.10). As in Lemma 3.2, we parameterize a neighborhood of s ∈ M A by s F , with |F | = n − |A| − 1, so that ds = T (s, A)ds F , where T = T (s, A) is a p × |F | matrix. Following a similar derivation as in the proof of Lemma 3.2 in Appendix B.1, we find the Jacobian of F A , with respect to the parameterization θ = (r A , s F ,σ) ∈ R n , where the matrix The joint distribution of (β, S,σ) is then given by the n-form, f A (r A , s,σ)dθ = g n (F A (r A , s,σ; β 0 , λ))|J A (r A , s,σ; λ)|dθ, (5.12) where g n is the density of (ε/ √ n). This result applies to all α ∈ [1, ∞), under the convention that p j = 1 for all j when α = 1, in which case (5.6) reduces to a scaled lasso [1,18].
An assumption for the above result is the uniqueness of (β,σ). Given σ > 0, denote byβ(σλ) the restricted minimizer of (5.6), which is the same as the block lasso (1.2) with tuning parameter σλ. Therefore, under Assumptions 1 and 2, β(σλ) is unique for any σ > 0. As established in Lemma 2 of [12], the profile loss function L(β(σλ), σ) is convex and continuously differentiable in σ. Thus, σ is given by any solution to the equation If this equation has a unique solution in (0, ∞), then (β,σ) is unique. We summarize this result into the following theorem.
We believe (5.13) indeed has a unique solution for almost all y ∈ R n under fairly weak but perhaps technical assumptions. See Appendix A.5 for a detailed discussion. To avoid this technical issue in practice, one may include another additive term aσ 2 in the loss function in (5.6), where a is a small positive constant. The uniqueness ofσ is then an immediate consequence of the strong convexity of the modified profile loss, L(β(σλ), σ) + aσ 2 .

Concluding remarks
By augmenting the sample space to that of (β, S), we have derived a closed-form density for the sampling distribution of the augmented block lasso estimator. Given the density, we have demonstrated the use of importance sampling in group inference, which can be orders of magnitude more efficient than the corresponding parametric bootstrap. For high-dimensional data, sparsity seems an essential assumption for inference, and consequently, an inference method is often built upon a non-regular penalized estimator. It is unlikely to work out an exact pivot in this setting, and thus, simulation-based approaches have been widely used. Our work of estimator augmentation opens the door to a large class of Monte Carlo methods for such simulations, which in our view is the main intellectual contribution. Due to the complexity of the sample space of an augmented estimator, development of efficient Monte Carlo algorithms is a highly demanding job and an interesting future direction.
Since x 2 is strictly convex in x, for any c ∈ (0, 1), by the hypothesis that Xβ (1) = Xβ (2) . Therefore, which is contradictory to the assumption that the minimum of L is L * . The uniqueness of S is an immediate consequence of the uniqueness of Xβ and that by the KKT conditions (2.3).
We will first establish sufficient conditions for solution uniqueness for α < ∞, while deferring the case α = ∞ to Appendix A.4. We start with more explicit expressions for Xβ andβ. Write the KKT conditions for each block in (2.3), By (A.5) and (2.2),β (−E) = 0. Now the E block of (A.5) withβ (−E) = 0 reads which shows that W (EE) S (E) ∈ row(X (E) ). Thus, we have since the right side is the projection of W (EE) S (E) onto row(X (E) ). Plugging the above identity into (A.7), we arrive at A solution to the above equation iŝ Then by the uniqueness of the fit Xβ (Lemma 2.1), for all solutionsβ we have To make the relation betweenβ (E) and S (E) more explicit, writê Similarly, R m×B denotes the space of matrices with columns indexed by Then (A.10) can be rewritten j∈Eγ j X (j) η(S (j) ) = Zγ E =ŷ.
Moreover, E,ŷ and Z are unique for any y, X and λ > 0.
The uniqueness of (ŷ, E, Z) is an immediate consequence of Lemma 2.1. So non-uniqueness can only come fromγ E when the linear system Zx =ŷ has multiple solutions for x, which happens only if null(Z) = {0}. Therefore, every block lasso solutionβ satisfies: provided that

A.3. Proof of sufficiency
If null(Z) = {0}, thenβ is uniquely given by (A.14) with γ = 0. In this casê γ j = (Z +ŷ ) j is necessarily nonnegative as there always exists a solution to the block lasso problem. Furthermore, |E| ≤ n and thus this solution has at most (n ∧ J) nonzero blocks. This leads to our first sufficient condition for the uniqueness ofβ. In the following, we prove Theorem 2.2 for α ∈ (1, ∞). Note that the case α = 1 is equivalent to the case α = 2 with p j = 1 for all j. Thus, this part covers the range of α ∈ [1, ∞) as in Theorem 2.2. The result for α = ∞ will be established in next subsection.
Proof of Theorem 2.2. Suppose that null(Z) = {0}. Then for some i ∈ E, there is a set A ⊂ E \ {i} and |A| ≤ n such that where we may assume that Z j , j ∈ A, are linearly independent and c j = 0 without loss of generality. Let r = y − Xβ denote the block lasso residual. By (A.5), for every j ∈ E, where the last equality follows from Lemma A.1 since S (j) ∈ S α * . Therefore, for λ > 0 we have 1 = j∈A c j .
Note that Z j = X (j) η(S (j) ), S ∈ row(XW −1 ) and S (j) α * = 1 for j ∈ E. The above equality is thus contradictory to the assumption that the columns of XW −1 are in blockwise general position (Assumption 2).

A.4. The case of α = ∞
Recall the KKT conditions in (A.5). Let α * = 1 in (A.6) to define E. By definitionβ (j) = 0 for j / ∈ E. For j ∈ E define K j and K c j as in (5.2). Note that both E and K j are unique due to the uniqueness of S andŷ = Xβ for any y, X and λ > 0 (Lemma 2.1). It follows from (5.1) and (5.2) thatβ Kj =γ j sgn(S Kj ) for each j ∈ E. Then the fitted valueŷ can be expressed aŝ If null(Z) = {0}, thenζ and henceβ will be unique and Z has at most n columns. Now we generalize Proposition A.5 to the block-(1, ∞) norm regularization. Furthermore, |G(β)| ≤ |E| ≤ n ∧ J and |E| + j∈E |K c j | ≤ n ∧ p.

A.5. Solution uniqueness of Equation (5.13)
Without loss of generality, let λ = 1. Due to the convexity of L(β(σ), σ) in σ, the solution set of (5.13) can always be written as an interval [σ 1 , σ 2 ], which reduces to a single point when σ 1 = σ 2 . Denote by [σ 1 (y), σ 2 (y)] the solution set for y. Suppose the solution to (5.13) is not unique for some y * ∈ R n , so that σ 2 (y * ) > σ 1 (y * ) > 0. Let us assume that the mapping y → (σ 1 (y), σ 2 (y)) ∈ R 2 is continuous at y * . Then, choosing a sufficiently small ball B(δ) centered at y * with radius δ > 0, we can find a Since dη(s (j) ) = D (jj) ds (j) for j ∈ A, we arrive at Plugging ds = T ds F into the above and letting θ = (r A , s F ) ∈ R n complete the proof. From (B.1) and (B.2), we have the following constraints on ds: (B.5) These linear equations can be used to find the matrix T (s, A) explicilty.

B.2. Proof of Theorem 3.3
Put where M and θ = (r A , s F ) are as in (3.8).
To establish the n-form in (3.12), it is sufficient to show (3.14) for u = s F . The bijective nature ofH (3.10) under Assumption 2 implies that Applying a change of variable in differential forms, we arrive at where the Jacobian is determined by (B.6) and f A is defined in (3.13). This completes the proof.
Putting this together with the identity (X T ) + Ψ = X/n, we havẽ and thus Then (3.15) follows immediately.

B.4. Proof of Corollary 3.5
It is easy to see that Assumptions 1 and 2 hold trivially if rank(X) = p < n. Thus, by Lemma 3.1 the mapping H is bijective. In this case, row(X) = R p and V ⊥ = {0}, which imply that the constraint (B.4) no longer exists. Therefore, , and M (r A , s, A; λ) is p × p. Now, the desired result is established by the same arguments as in Appendix B.2 with U = X T ε/n in place of (ε/ √ n) and H A in place ofH A .

B.5. Proof of Lemma 4.1
First, the matrix Q that defines the constraint (B.1) does not depend on s. Second, the sphere S pj −1 α * , j ∈ A, can be parameterized by s (j)\k for almost every point on the sphere, where k ∈ G j is chosen as the last component in the group. More specifically, we may parameterize the positive half of S , and the negative half in a similar way, both using the variables indexed by G j \k. Therefore, we can always choose F (s, A) = F (A) to parameterize almost every point in M A .

B.6. Proof of Theorem 5.1
Put R = Q T with the matrix Q as in (B.1) and let I = K ∪ G A c ⊂ N p . Any s ∈ M A,K must satisfy the following equality constraints: Next, we derive these results and find P(A = A). A few pre-calculations are in order:  (1) .
Then the density f ∅ is obtained immediately as in (C.

C.2. Derivations in Example 2
This section is divided into three parts: Part 1: Derivation of (3.22). For v ∈ R p and j ∈ N J , define u = v j ∈ R p so that u k = v k for k ∈ G j and u k = 0 otherwise. Let b ∈ R p denote the value for β, i.e. b (j) = r j s (j) for j ∈ N J . Straightforward algebra leads to: where C(m) > 0 is a constant: With a change of variable, v = x/ 1 + which completes the proof.

C.3. Derivations in Example 3
We note that the constraint (B.5) reduces to ds j = 0 for j ∈ A, which implies that T A• = 0 as in property (ii). Recall that B = N p \A. The constraint imposed on ds B comes from (B.4) and is thus independent of s, hence property (i). As a consequence, the set of free coordinates of s is always a subset of B, i.e. Since |s j | = 1 for j ∈ A and p j = 1, Substituting this into (3.12) gives the density in (3.26). To compare (3.26) with Theorem 2 in [29], we apply the following change of variable: Let b j = s j r j denote the value forβ j for j ∈ A. Plugging into (3.26) that r j = |b j |, s j = sgn(b j ) and dr j = s j db j for j ∈ A, we obtain the density for (β A , S B , A) parameterized by (b A , s F ): where we have again used |s j | = 1 for j ∈ A in the change of the volume elements.