Sparse regression with exact clustering

: This paper studies a generic sparse regression problem with a customizable sparsity pattern matrix, motivated by, but not limited to, a supervised gene clustering problem in microarray data analysis. The clustered lasso method is proposed with the l 1 -type penalties imposed on both the coeﬃcients and their pairwise diﬀerences. Somewhat surprisingly, it behaves diﬀerently than the lasso or the fused lasso – the exact clustering eﬀect expected from the l 1 penalization is rarely seen in applications. An asymptotic study is performed to investigate the power and limitations of the l 1 - penalty in sparse regression. We propose to combine data-augmentation and weights to improve the l 1 technique. To address the computational issues in high dimensions, we successfully generalize a popular iterative algorithm both in practice and in theory and propose an ‘annealing’ algorithm applicable to generic sparse regressions (including the fused/clustered lasso). Some eﬀective accelerating techniques are further investigated to boost the convergence. The accelerated annealing (AA) algorithm, involving only matrix multiplications and thresholdings, can handle a large design matrix as well as a large sparsity pattern matrix.


Background
This paper assumes a regression setup y = Xβ + ǫ, ǫ ∼ N (0, σ 2 I), (1.1) where y is the observed response vector and X is the regression (design) matrix of size n-by-p. The main goal is to recover β under some sparsity assumptions. One typical assumption is that β is sparse in the sense that many of its components are zero (referred to as the zero-sparsity in this paper), where the lasso [26] by solving the following convex optimization problem is a popular method min β 1 2 y − Xβ 2 2 + λ β 1 , or stated in a multi-objective way [6] min( y − Xβ 2 2 , β 1 ). (1. 2) The test error is yet always the first concern in fitting a regression model, which is assumed throughout the paper. One advantage of the lasso lies in its computational feasibility even for large-scale data. For some concrete computation procedures, we refer to the LARS (Efron et al. [13]), the homotopy method (Osborne et al. [22]), and a recently re-discovered iterative algorithm (Fu [15], Daubechies et al., Friedman et al. [14], Wu & Lange [29]) among others. There are numerous theoretical works on the zero-sparsity, Bunea et al. [8], Zhang & Huang [31], Candès and Tao [9], to name a few.
On the other hand, motivated by the intuition that the l 1 -norm is a convex relaxation of the l 0 -norm, researchers have tried far more l 1 -type penalties to capture various types of sparsity, especially in the field of signal processing and computer vision. Unfortunately, there is not much theoretical analysis in the literature, and there is a lack of scalability of current computational algorithms in very high dimensions. This paper aims to shed light on a range of issues related to l 1 sparsity recovery in a general setup.
The rest of the paper is organized as follows. Motivated by a gene clustering problem, Section 2 proposes a clustered lasso method, and provides a generic sparse regression framework with customizable sparsity patterns. A theoretical study is performed on the power and limitations of the l 1 -penalty in Section 3. Improving techniques of data-augmentation and weights are also investigated. Section 4 tackles the computation problem in high-dimensional space by developing an iterative algorithm with theoretical justifications. It can be seen as a generalization of the popular coordinate descent algorithm. All technical details are left to the Appendices.

Clustered lasso
The motivation of this paper is to perform a microarray study to discover agingrelated genes. The microarray dataset consists of large-scale gene expression data of 133 human kidney samples. The gene expression matrix X is of size 133× 44,928, and the responses, y, are the ages of the 133 subjects. After normalizing the data, one can run the lasso to classify the large number of genes as relevant and irrelevant factors in response to age. The number of relevant genes is limited by n. To deeply study the gene effects and to obtain possibly more relevant variables in the model, a reasonable idea is to make the nonzero coefficients come out equal in clusters. As a form of regularization, this is much more interpretative (in terms of average expression values) than the ridge regression. One can construct group-based predictors that are measured more accurately and are less sensitive to noise. Later, if some gene expression values are missing in the microarrays, these groups can be used to impute the missing values. Finally, the gene groups may provide some biological insights to identifying the functionally-related genes that are coexpressed in response to age. (See Dettling and Bühlmann [11] or Jörnsten and Yu [17] for a detailed description of this biological motivation.) In consideration of these benefits, we would like to identify and group relevant variables based on their effects (coefficients).
The proposed problem requires combined regression and clustering analysis. One possible way is to directly apply some clustering procedure to the estimated coefficients, which often results in an ad-hoc algorithm. The estimate from the fitting stage may not be stable. To carry out the clustering task in the second stage, one needs to specify a similarity measure and the number of clusters. Typically, the standard error information of the estimate is discarded in this step. More importantly, the clustering criterion is different than the test error, so the obtained clusters may not improve model fitting at all. For high-dimensional data, this two-stage approach is unstable and inaccurate. Alternatively, a more ambitious and more trustworthy means is to take the clusters into account when fitting the regression model, which can be achieved by integrating a penalty for improper clustering into the objective function. We refer to it as sparse regression with exact clustering. The notion of "exactness" is necessary in proper clustering because without the standard error information, statisticians cannot determine how close two estimates sayβ i andβ j are, even if the gap between them is small; however, once getting a gap estimate exactly equal to 0, one usually has enough confidence to put gene i and gene j into the same group. This exactness also enhances model parsimony (in comparison to the ridge regression or the lasso) -the number of degrees of freedom of the model is essentially the number of nonzero clusters.
In the language of multi-objective optimization [6], the problem can be formulated into min( y − Xβ 2 2 , β 0 , i<j 1 βi =βj ). (2.1) Two types of sparsity are desirable: zero-sparsity and equi-sparsity, achieved by minimizing β 0 and i<j 1 βi =βj , respectively. The problem is a combinatorial optimization and is NP-hard [1]. Motivated by the fact that the l 1 -penalty is a convex approximation of the l 0 -penalty in optimization, we may try to minimize ( y − Xβ 2 2 , β 1 , i<j |β i − β j |), or equivalently, referred to as the clustered lasso.
Remark. Though similar in form, the clustered lasso (2.2) is different than the fused lasso (Tibshirani et al. [27]) in that it does not require the regression features to be ordered and so the clustering problem is much more challenging. In fact, the clustered lasso organizes all features and thus can be used as a pre-processing step for the fused lasso. This idea is used later in Section 3.3 in the algorithm design. It is also worthwhile to make a comparison between the clustered lasso and the grouped lasso methods [30,32].
Grouped lasso assumes the grouping of features (predictors) is known, arising naturally from the underlying background, such as the dummy variables introduced for a multi-level factor. The coefficients within the same predictor group are not necessarily equal. The clustered lasso performs supervised clustering and groups the predictors taking into account both X and y. Bondell and Reich's OSCAR [5] is close in spirit in this sense which minimizes It is not difficult to see that the objective function can be written as 1 2 y−Xβ 2 2 +λ ′ 1 β 1 +λ 2 i<j (|β i |−|β j |), and thus OSCAR seeks zero-sparsity and equi-sparsity in |β|.
It is more convenient to introduce a general framework for sparse regression where the objective is to obtain a regression estimate with T -sparsity, i.e., where T is the sparsity pattern matrix specified by the user. A feasible alternative to overcome the NP-hardness is to solve The T matrix can be used to characterize coding sparsity in β, not only the zero-sparsity in the narrow sense. Some examples are presented as follows.
Example 2.1 (Mixed T ). Suppose a priori knowledge of β is available: the successive differences of (β 1 , β 2 , β 3 ) are equal, β 3 equals β 4 , and β 5 is zero. Then T may include rows of  to capture all sparsity in fitting the regression model.
Example 2.2 (Clustered/Fused lasso). In our clustered lasso problem, where F is a matrix including all pairwise differences (see (4.28)). And the fused lasso [27] replaces the F in (2.5) by a neighboring difference matrix (see (4.26)).
Example 2.3 (Dense T ). T can be given by a wavelet transformation matrix (possibly overcomplete), which is useful in signal denoising and compression.
Example 2.4 (Spatial T ). In the field of computer vision and image processing, there exist many meaningful choices for T , which can be constructed from various spatial operators, such as the following Laplacian of Gaussian used in edge detection  The two-dimensional fused lasso [14] also makes an example here.
In summary, the customizable T represents the sparsity requirement posed in regression analysis. Unless otherwise specified, our studies in the rest of the paper are toward an arbitrarily given T matrix.

Limitations and improvements of the clustered lasso
Somewhat surprisingly, the plain clustered lasso (2.2) suffers some serious problems. Its test error is often not small although it has two regularization parameters. More importantly, it barely clusters the predictors properly in experiments. We demonstrate an example as follows.
This is a parsimonious model with only 4 degrees of freedom (4 nonzero clusters). The training sample size is 100. To make the clustered lasso less affected by various parameter tuning strategies, we generate a large enough validation dataset (of size 1000) to find the optimal λ 1 and λ 2 . Ideally, T β have 27 zeros, 3 corresponding to the zero-sparsity, and 24 corresponding to the equi-sparsity. But for the clustered lasso estimateβ, T zβ hardly shows exact-clustering effects, as demonstrated by Figure 1. Although setting the regularization parameters in the objective function to be large results in more zeros in Tβ, the fitted model is often poor in both estimation and prediction. The problem exists for other ad-hoc parameter tunings. Increasing the sample size does not resolve the issue, either. Our theorem below reveals that this bizarre  behavior is in fact due to the l 1 relaxation (see Theorem 3.1). In this section, we study the limitations of the l 1 technique in T -sparsity recovery and propose some effective improvements.

Power and limitations of the L 1 -penalty in sparse regression
It is widely known that the l 1 -norm penalty is a convex approximation to the l 0norm penalty in optimization. For instance, the variable selection problem can be formulated as an l 0 -minimization and discovering the best subset of predictors is NP-hard. The lasso replaces the l 0 -norm with the l 1 -norm in the criterion and offers a computationally feasible way to tackle this problem. However, it may not be selection consistent for coherent designs [33,34]. For a general T , the nature of this l 1 approximation is worth careful study in theory.
For clarity, we adopt the generalized sign notation. Introduce Sgn(v) = {s : 1] if v i = 0}, and sgn(v) is used to denote a specific element in Sgn(v). The usual sign vector is defined as Letβ be an optimal solution to the generic sparse regression T may not have full rank. By the KKT optimality conditions [25] (the nonsmooth version),β is an optimal solution if and only ifβ satisfies for some sgn(T β). We work in a classical setting (C): assume that p, β are fixed and n → ∞; Σ X T X/n → C, a positive definite matrix, with probability 1. Throughout this paper, given a matrix A, we use A I to denote the submatrix of A composed of the rows indexed by I, such that A I α = (Aα) I , ∀α. Given two matrices A, B, B ⊂ A means that B is a submatrix of A, composed of certain rows of A. Hence A I ⊂ A.
Consistency is a weak requirement, placing no restrictions on Σ or T . It can be easily achieved by a properly chosen λ. Yet in using the l 1 penalty, we expect something more in sparsity recovery. In this paper, we are more interested in another notion of consistency.
The zero-s consistency is a key notion used to characterize sparsity recovery. For example, in the clustered lasso, zero-s consistency means successfully discovering all the true groups asymptotically. Returning to our T -sparsity problem, define z = z(T , β) = {i : (T β) i = 0}, nz = nz(T , β) = {i : (T β) i = 0}. Then we have the following result.
In the following studies, the joint zero s-consistency will be our main concern. Namely, we study the conditions for zero s-consistency (with respect to some T 1 ⊂ T z ) under the consistency assumption. This is because in practice although blindly increasing λ would bring into play the thresholding power of the l 1 -penalty, we prefer a tuned value of λ with small test error (like the one obtained from cross-validation). The consistency requirement complies with the usual tuning criteria.
We use A + to denote the Moore-Penrose inverse of A. Then a necessary condition forβ to be zero s-consistent w.r.t. T 1 is

2)
and a sufficient condition is given by For a concrete example, suppose T z has full row rank and substitute T z for T 1 , and T nz for T 2 . Then, (3.2) and (3.3) become and respectively. Thus the sufficient condition is pretty strong. Simple algebra also shows that they further reduce to the irrepresentable conditions [33,34] in the lasso case where T = I. As another interesting example, suppose (T 1 , T 2 ) is 'separable' in the sense that T = T 1

C2
, and assume that C 2 is nonsingular. Then the LHS of (3.3) becomes Therefore, if the entries of C 12 are small enough (in absolute value), the zero s-consistency w.r.t. T 1 naturally follows. (Note that (T 11 S −1 T T 11 ) + is a continuous function of C 12 since the rank of T 11 S −1 T T 11 is preserved.) This conclusion coincides with the lasso studies where T = I (see, e.g., Zhao and Yu [33]). Unfortunately, the clustered lasso does not fall into this class because the rows of the T encompass all pairwise differences and thus never result in a separable (T z , T nz ).
Theorem 3.1 indicates that in contrast to consistency, zero s-consistency imposes further constraints on Σ (the data) aside from the controllable regularization parameter λ. Without going into the mathematical details, these conditions intuitively mean that one should have good control over (T z C −1 T T z ) + · (T z C −1 T T nz ) to ensure the l 1 penalty is effective for sparsity recovery. For instance, if we consider the joint zero s-consistency with respect to T 1 for all signals satisfying T 1 β = 0, the sufficiency condition (3.3) becomes Hence the magnitude of the entries of (T 1 C −1 T T 1 ) + T 1 C −1 · T T 2 plays an important role. Given β, T 1 , and C, (3.6) makes a huge difference between the fused lasso and the clustered lasso: the T 2 of the clustered lasso contains up to O(p 2 ) more rows in addition to the T 2 of the fused lasso. Recalling that the matrix infinity norm is the maximum of the l 1 -norms of all rows, we see the clustered lasso is certainly more likely to break (3.6).
We illustrate the conditions with the previous example in Figure 1 for both clustered lasso and fused lasso (which uses the correct ordering from the true β). For convenience, we ignore the zero-sparsity constraint and concentrate on the equi-sparsity. Substituting T z for T 1 in (3.2), the LHS equals 0.6 and the RHS equals 1 for the fused lasso, but these quantities are 3.0 and 1.6 respectively for the clustered lasso. In (3.6), the LHS is only 1.7 for the fused lasso, but 31.2 for the clustered lasso. The fused lasso and the clustered lasso (though similar in form) thus show remarkable difference in the behavior of the l 1 -penalty, the latter much more difficult to recover the true sparsity even asymptotically.
This explains the dilemma we encountered earlier. No matter how we devise a scheme to tune the regularization parameters, the design criterion favors the models with small generalization error. Therefore, the tuned regularization parameters cannot be very large seen from Proposition 3.1 (if we do not want our estimate to be inconsistent). But Proposition 3.2 and Theorem 3.1 then limit the l 1 's ability to enforce sparsity. Although this requirement on Σ might not be very restrictive for the lasso or even for the fused lasso, it becomes so stringent for the clustered lasso that the expected exact-clustering effect is seldom seen strong enough in applications. In the next subsection, we propose different means to improve the naïve l 1 -penalty to gain exact clustering.

Weights
To further improve the sparsity weights can be added into the l 1 norm. Zou [34] shows that asymptotically, this weighted form of lasso (adaptive lasso) is sign consistent and enjoys the oracle properties. This technique applies to the generic sparse regression (3.1). According to Theorem 3.1, if we could rescale the rows of T in an ideal way then, the LHSs of (3.2) and (3.3) may be reduced significantly for ε small enough, while the RHSs remain unchanged. In fact, one of the advantages of the fused lasso (see example 2.2) is that the two regularization parameters provide adaptive weights for the components of T β. This weight construction is based on different types of sparsity. For a general T , however, it may not be possible to supply this information. Furthermore, it is really between the zero components (T z β) and nonzero components (T nz β) that the weights should make a big difference. We introduce weights for each individual component of T β and consider the weighted sparse regression of the form where w i are positive.
Theorem 3.2. Assume the classical setup (C). Suppose Then for some properly chosen λ(n), the optimal solutionβ to (3.7) is both zero s-consistent with respect to T z and √ n-consistent as long as (3.8) is a broad condition. Essentially it only requires and so provides a flexible way for weight construction. We can use 1/w i = |(Tβ wts ) i | with any consistent estimateβ wts . This can be viewed as a generalization of Zou [34]. (In fact,β wts does not even have to be an estimator of β seen from (3.9), which also justifies the use of one-step SCAD weights [36].) On the other hand, one should be aware that Theorem 3.2 is an asymptotic study with p fixed. Therefore, although the weighted l 1 -penalty can increase model sparsity, careful consideration must be given to the practical weight construction especially in large-p applications. Another issue is that adding the weights does not improve the test error very much. It can even hurt the goodness-of-fit to some extent. This is undesirable in statistical modeling and we would like to combine it with the following data augmentation technique.

Data augmentation
It has been recognized that the l 1 with a data-augmentation modification, such as the elastic net (eNet for short), can achieve much smaller test error and can resolve singularities and the 'grouping' issue [35]. To introduce the dataaugmentation (DA), we facilitate our discussion by focusing on the zero-sparsity in this subsection. The technique carries over to a customizable T as will shown in Section 3.3.
If one cares about prediction accuracy only, the ridge regression is a good alternative to the lasso. In view of data augmentation, it considers an augmented problem with the design and response given by where √ λI decorrelates the predictor columns. The × part is often 0, but datadependent choices might be better such as using the univariate estimate √ λβ uni : In fact, following this idea, we can give the elastic net a natural characterization and explain why an extra factor comes in to correct the naïve eNet [35]. In [35], the naïve eNet is introduced as a combination of the lasso and ridge regression by imposing both the l 1 penalty and the l 2 penalty on β. Then, to guard against double shrinkage, Zou and Hastie gave an empirical way to improve this naïve estimate by multiplying it by a factor of 1 + λ 2 . The resulting eNet estimate, according to Theorem 2 of [35], is defined bŷ We show the following result.
In addition, since the eNet uses a search strategy of first picking a grid of values for λ 2 , then searching over the λ 1 -space for each λ 2 in the grid, the tuned β λ * 1 ,λ * 2 based on (3.12) coincides with the tuned eNet estimate. In short, the eNet solves the lasso problem of This indicates the power of the DA: even using a not so accurate estimator (β uni is not even consistent for nonorthogonal designs) can still effectively reduce the test error.
In the next, we propose a nondiagonal manner of data augmentation. To see the motivation we ignore the l 1 -constraint for the moment and consider the augmented data fittinĝ We would rather replace it by an adaptive scale parameter. One way is to solve the following problem jointly for β and s min (β,s) or equivalently (by optimizing over s), In general, given an initial estimateβ aug , we construct a matrix Λ(β aug ) and propose the non-diagonal data augmentation by solving Supposeβ aug = 0. The eigenvalues of Λ(β aug ) are 1's with multiplicity p−1, and 0 with multiplicity 1, and the eigenvector corresponding to 0 isβ aug / β aug 2 . It is not difficult to show that the whole input matrix is rank-deficient if and only if all components of Xβ aug are exactly equal to 0. Therefore, this resulting nondiagonal DA from introducing a data-dependent scale parameter is still able to decorrelate the covariates in real-world applications.
Interestingly, from an empirical Bayesian point of view, (3.16) corresponds to a multivariate Gaussian prior with a nondiagonal and degenerate covariance matrix (rank(Λ) = p − 1). Therefore, not all information provided byβ aug is used as the prior knowledge. The new data augmentation is more robust and accommodates a less accurate initial estimate. Indeed, the tuning of the regularization parameter λ 2 makes it possible to save one degree of freedom in the construction of Λ.

Algorithm design for supervised exact-clustering
The data-augmentation and weights can be combined in sparse regression to reduce test error and increase model sparsity simultaneously. We discuss the practical algorithm design for the supervised exact clustering problem to demonstrate this point.
Givenβ aug andβ wts , we perform the nondiagonal DA and introduce weights into the l 1 penalty, which amounts to solving the following optimization problem: where λ, τ are two regularization parameters, X aug = √ τ Λ(β aug ), y aug = 0, and w i = 1/|(Tβ wts ) i |. (3.17) will be referred to as the DAW version of the sparse regression. In particular, the DAW version of the clustered lasso will be called DAW-CLASSO for convenience.
For the supervised exact clustering problem, we propose to improve the plain clustered lasso estimate (denoted byβ c-lasso ) as follows. (Figure 2 plots the outline of the procedure.) (i) Fit a fused lasso model with the covariates ordered according toβ c-lasso , the estimate denoted byβ f -lasso . (ii) Substitutingβ c-lasso forβ aug andβ f -lasso forβ wts , fit the DAW-CLASSO, the estimate denoted byβ daw-classo 1 .
Then, we can repeat (i) with the covariates re-ordered according to the last estimateβ daw-classo 1 , to obtain an updated fused lasso estimate, and then repeat (ii) withβ daw-classo 1 used for data-augmentation and the updated fused lasso estimate for weight construction. The new estimate is denoted byβ daw-classo 2 .
The experience shows the improvement brought byβ daw-classo 2 is already significant enough, although ideally one can repeat the DAW process within the allowed time.

Simulation studies
We carried out simulation experiments to empirically study the improvement brought by DAW. Each simulation dataset contains training data, validation data, and test data, the numbers of observations denoted by # ="·/ · /·" respectively as follows. The rows of X are independently drawn from N (0, Σ). We use ({a 1 } n1 , · · · , {a k } n k ) to denote a column vector made by n 1 a 1 's, · · · , n k a k 's consecutively.
Example 3.1 and 3.2 demonstrate a situation of many small clusters in the coefficients, where overlap is likely to occur. The design matrix of the second is much more correlated than that of the first. Example 3.3 assumes a more challenging negatively correlated design and the coefficients have big clusters as well as small clusters. The signal to noise variance ratio of this example is only about 1. Note that it is not the goal of this study to propose or advocate a best means for parameter tuning. We set aside a separate validation dataset to tune the parameter. This large validation tuning ensures fair and stable performance comparisons. For those with multiple regularization parameters, we use the alternative search strategy [24] which has been shown to be fast and efficacious.
Each model is simulated 50 times. Then we measure the performance of each algorithm by the test error and the proper sparsity. The test error is charac- The proper sparsity is defined as 100% · |{i : (T zβ ) i = 0}|/|z| which represents the percentage of proper zeros in Tβ. It is a very sensitive measure for the clustering problem since it takes into account all pairwise comparisons within each cluster. Table 1 compares the performance of the clustered lasso (C-LASSO) and its DAW versions -DAW-CLASSO 1 and DAW-CLASSO 2 (by performing the DAW process once and twice, see Section 3.3). The clustered lasso does not exhibit enough exact-clustering even in the highly correlated Example 3.2. The inadequate proper sparsity indicates the great challenge of supervised clustering, especially with insufficient training data. A similar phenomenon is observed in the fused lasso studies (Tibshirani et al. [27]) where the ordering of the covariates is available.
The combined data-augmented weighted l 1 technique significantly improves the performance of the clustered lasso in finite samples: the test error is reduced effectively, as a result of (nondiagonal) data-augmentation; simultaneously, the proper sparsity is enhanced after introducing weights into the l 1 -penalty, by 70% at the minimum. Both improvements are effective regardless of correlation strength and cluster size.
On the other hand, in spite of the encouraging simulation results we noticed that the computational difficulties cannot be underestimated. The interior point methods based semidefinite programming (SDP) solvers (like SeDuMi or SDPT3) cannot even handle the clustered lasso for a moderate value of p. Therefore, we are in great need of a fast algorithm with good scalability to apply the proposed methodology in high dimensions.

A fast algorithm for solving the sparse regression
In applying the (improved) clustered lasso to the microarray data, we encounter insurmountable difficulty with all optimization procedures (to date), mainly due to the fact that T has p columns and O(p 2 ) rows. Our experience shows that it is already extremely difficult or infeasible to carry out the supervised clustering for p just greater than 110. In this section, we propose a simple but scalable algorithm to solve the generic sparsity problem in practical applications. It is motivated by the popular coordinate descent algorithm for computing the lasso solution path.

Motivation
We start with the lasso which minimizes By the KKT optimality conditions [25],β is an optimal solution if and only if β satisfies the equation where the same generalized sign notation is used to denote a subgradient of β 1 .
There is a simple but important fact about sgn: Given an arbitrary sgn(β) ∈ Sgn(β), let ξ = β + λ sgn(β), then where Θ S (·; λ) (or Θ(·; λ), for simplicity) is the soft-thresholding operator using λ as the threshold value. Rewriting (4.2) as motivates an iterative design to solve (4.1) If Σ 2 < 1, this nonlinear process can be shown to converge 1 to an optimal point even if Σ is singular (in which case (4.4) is not a contraction, but only a nonexpansive mapping) [10]. (4.4) has been proposed in different forms [10,14,29] and is strongly advocated for large-data problems. In particular, Daubechies et al. [10] proved nice theoretical results on its convergence in a functional framework; Friedman et al. [14] demonstrated its amazing performance in terms of the computation time compared to the homotopy method and the LARS. This iterative algorithm is simple to implement and has very good scalability. Now we consider the generic sparsity problem where T is a given sparsity pattern matrix that can be specified by users in different situations. In this section, we assume T has full column rank (and thus is a square or 'thin' matrix). The optimalβ satisfies the equation Similar to the derivation of (4.4), we can get The difficulty is, however, T T has no left inverse in the case of a 'thin' T . For example, in the fused lasso, T does not have any right inverse. Accordingly, it is difficult to develop a proper iterative algorithm based on (4.6). This poses a great challenge in generalizing the coordinate optimization from the lasso to the fused lasso or the clustered lasso. In [14], introducing descent cycles, fusion cycles, and smoothing cycles, Friedman et al. gave an ad-hoc design for solving the diagonal fused lasso. There is no guarantee that the procedure converges or provides a solution to the original problem.
We reparameterize (4.5) by introducing H satisfying HT = I. Assuming that the SVD decomposition of T is given by T = U DV T , we take H = V D −1 U T throughout this section. The generic sparsity regression problem (4.5) is equivalent to the following constrained lasso problem: This suggests a simple iterative way to solve (4.5) and We can prove that the sequence of iterates defined by (4.8) must converge under some mild conditions. Yet a further challenge is that it does not converge to the right solution.

The 'annealing' algorithm
Observe that the original optimization problem (4.5) is also equivalent to for any k positive. We get a variant of (4.8) and (4.9) we obtain For clarity, we write (4.12) as even and odd updates γ (j) (4.14) Theorem 4.1. The following results hold for the sequence of iterates defined by (4.14): 1. Convergence. There exists a k 0 > 0 such that for any k > k 0 , γ converge given any initial value in (4.14). That is, as j → ∞, we have 2. Optimality. As k → ∞, every limit point of β(k) (or γ e (k), γ o (k)) is an optimal solution to (4.5) (or (4.7)). 3. Rate. Let ∆(k) = γ e (k) − γ o (k), f opt be the optimal value in (4.7). Then, 2 and where σ max (σ min ) denotes the largest (smallest) singular value of the corresponding matrix.
See Appendix C for the details of the proof by use of a generalization of Daubechies et al.'s convergence theorem [10]. In the following, we abbreviate the subscripts of γ (j) e (k) and γ e (k) for simplicity. We summarize more findings in the case that Σ is nonsingular: where ρ 0 = λ + min (H T ΣH), the smallest positive eigenvalue of H T ΣH; and Moreover, sign consistency can be achieved by finite k. Specifically, for any k large enough, where the index set z satisfies γ opt z = 0.
log(1−ρ0/k 2 ) ≈ k 2 · 1 ρ0 log(δ/ǫ 0 ), which indicates that the number of iterations required at k is O(k 2 ). On the other hand, from (4.15) or (4.19), the error is of order 1/k 2 . In general, for a small value of k, β (j) (k) converges fast but to an inaccurate solution, while when k gets larger, β (j) (k) converges more slowly but to a more accurate point.
We can adopt an 'annealing' design (not the simulated annealing) with k acting as the inverse temperature parameter. Run (4.14) for some k till convergence, then use the estimate as the initial point to move on to a new iteration associated with a larger k. The outline for our annealing algorithm is given as follows. The details of the design are given in the next subsection. o , and β (j) according to (4.14) and (4.13).
is small enough, exit.
-Otherwise, increase k to a larger value.
• Let j ← j + 1; go to the next iteration.
Both the inner j-convergence and the outer k-convergence can be slow. The convergence rates may not be geometric (see, e.g., (4.19)) caused by the nonexpansive operators. Some effective techniques are needed to boost the convergences.

Accelerated annealing
It is natural to think of updating k at each iteration j. In this inhomogeneous updating, the 'cooling schedule', i.e., the growing manner of k(j), is crucial to guarantee an optimal convergent point that solves (4.7).
then the inhomogeneous chain must converge to the optimal solution.
We can take k(j) = √ j for instance. A detailed proof of Theorem 4.2 is provided in Appendix C, based on a useful decomposition for inhomogeneous chains due to Wrinkler [28]. In theory, a valid cooling schedule should be no faster than the k(j) satisfying (4.20). Theorem 4.2 also implies that it essentially takes polynomial time to yield a good solution, in contrast to the exponential time in the simulated annealing [20]. But √ j might still be too slow in practice. In most applications, we are only interested in obtaining a good enough solution, thereby allowing for an even faster cooling schedule. Empirically, we recommend the stagewise homogenous updating -run a sequence of homogenous chains, each at a fixed k. The trick is to run the chains for small values of k first to complete the major improvements over the initial point, but not till convergence since γ(k) may not be close to γ opt ; the fine adjustments are left to large-k iterations. An illustration of the cooling schedule is given in Figure 3. In implementation, k is doubled once a certain stopping criterion is met. We find the following type of relative error o (k). Due to (4.15), we may also use k 2 · β (j+1) − β (j) . Accelerating the inner j-convergence is even trickier because the iteration here is nonlinear and nonsmooth. Introduce K satisfying where U ⊥ is the orthogonal complement of U , and let α = H T X T y/k 2 . We represent the updating kernel (4.14) as (see Appendix C) We consider two forms of relaxation parameterized by ω for the above nonlinear process: Both relaxations seem to converge in practice and yield an optimal solution when 0 < ω < 2. When ω = 1, they degenerate to the nonrelaxation form (4.22). Before proceeding, we introduce some more operators T k , Θ k , T k , T k : for any vector v, Proposition 4.2. For Relaxation (II), given any γ (0) , γ (j) (k) converges to a fixed point of T k as j → ∞, provided 0 < ω < 2. All conclusions in Theorem 4.1 hold under this condition, except that the last statement becomes The cooling schedule Theorem (Theorem 4.2) also applies for 0 < ω < 2.
Convergence analysis is more difficult for Relaxation (I). Currently, we have obtained the following result. Proposition 4.3. For Relaxation (I), given any γ (0) , γ (j) (k) converges to a fixed point of T k as j → ∞, provided 0 < ω ≤ 1. If 2T k − I is nonexpansive, the same conclusion is true for 1 < ω < 2.
The proof presented in Appendix C is motivated by Browder and Petryshyn's reasonable wanderer [7]. We claim that γ (j) converges for 0 < ω ≤ 2 (based on extensive experience). Relaxation (I) with ω = 2 is of particular interest: in this situation, ξ (j) does not converge (possessing two accumulation points), but γ (j) converges and the limit depends on U ⊥ U T ⊥ ξ (0) -if U ⊥ U T ⊥ ξ (0) = 0, this limit is an optimal solution, or a fixed point of T k . For an inaccurate initial point, the relaxation with ω = 2 can reduce the number of iterations by 40% or so in comparison to ω = 1.
We now state the full procedure for the accelerated annealing (AA) algorithm. Suppose X, y, λ, T (= U DV T ), and H are known. In the initialization stage, we set a starting value of β (cur) and construct γ (e) . The initial k is given by (4.25). The iteration (starting with j = 0) is specified below with ε outer , ε inner,a , ε inner,b as prescribed error tolerances.
This AA algorithm is very simple to implement, and can solve the sparse regression for any T . There are no complicated operations such as matrix inversion involved in the iteration. In addition, since the problem is convex, a pathwise algorithm with warm starts is preferred, where the estimate associated with the current value of λ is used as the initial point in AA for the next value of λ. An even more effective trick is to construct the initial estimate via the linear extrapolation of the last two estimates.
The computational cost of the AA algorithm is primarily due to matrix multiplication and thresholding. Although an SVD for T is required, it only needs a one-time calculation. Furthermore, for some regularly patterned sparsity matrix, like the fused lasso and the clustered lasso, we are able to provide explicit analytical solutions. Let be the sparsity matrix for the fused lasso, and let denote the sparsity matrix for the clustered lasso, where F 2 is a pairwise difference matrix that can be defined by Proposition 4.4. The following formulas provide the SVDs for the fused lasso and the clustered lasso, with To apply Proposition 4.4 to the DAW-CLASSO (3.17), we need to generalize our algorithms and results to the weighted version of (4.5) where Λ = diag{λ i } with λ i > 0. This is trivial though: replacing the universal threshold λ by the componentwise thresholds λ i , we find all conclusions and proofs carry over.

Practical performance
As previously mentioned, the task of supervised clustering is quite challenging in computation. Since the clustered lasso is a convex optimization problem, the general-purpose solvers such as SeDuMi and SDPT3 can be applied. These solvers are usually based on interior point (IP) methods. Empirically, we find that they are more appropriate for small scale problems and are more accurate than the proposed accelerated annealing algorithm if feasible. The SDP solvers typically fail when p is much greater than 100, due to high computational complexity and memory requirements. The size of the sparsity matrix T or its left inverse H can be huge (O(p 3 )) even for a medium value of p. For the kidney microarray data described in Section 2, although we can reduce the problem size by gene filtering -for example, FDR < 0.05 yields about 800 geneswe could only manage to run the clustered lasso with general-purpose convex optimization softwares for p less than 110. These seriously restrict the use of the clustered lasso in real-world applications. By contrast, the AA iterations only involve low-complexity operations like matrix-vector multiplications and componentwise thresholdings, which provides good algorithm scalability. Moreover, statisticians usually have the need to compute the whole solution path to tune the regularization parameters, and so warms starts (or our extrapolated warm starts) are particularly effective to speed the computation (due to the convexity of the problem). This, however, does not apply to the SDP solvers which compute the solution path as a series of independent optimization problems. Furthermore, there is no need to compute or store H in AA seen from Proposition 4.4. In fact, U 1 and V 1 are the only dense matrices needed in calculating all matrix-vector multiplications, and they are of order p × p. This reduces the storage needs to O(p 2 ). We would also like to point out that the cooling schedule can be used to provide a speed-accuracy tradeoff. In a limited time situation, one may use a faster cooling scheme to obtain a greedy solution. Though not necessarily optimal, it may serve the needs of some applications.
Finally, we present the results of the AA-implemented DAW-CLASSO on the kidney data. Supervised clustering was applied to the 800 most correlated genes after FDR filtering. Five-fold cross-validation was used to tune the parameters. Figure 4 demonstrates the results. The coefficient estimates are successfully clustered and gene groups are formed directly due to the exact-clustering effect. It seems that some of them might be tricky to be found by a two-stage procedure (modeling → clustering, see Section 2) based on studying the differences between the coefficient estimates only. In addition, different than many ad hoc clustering procedures, the supervised clustering optimizes the clusters during the model fitting process and automatically selects the number of clusters and cluster sizes.

Summary
We studied a generic sparse regression problem with a customizable sparsity pattern matrix T , motivated by a supervised gene clustering problem. Interestingly, we found both in practice and in theory that the granted power of the l 1 -penalty to approximate the l 0 -penalty can be rather poor (even for large samples), say, when T nz is large and (T z , T nz ) is not 'separable' (see Theorem 3.1). This causes serious trouble for the clustered lasso to achieve exact-clustering.
We further proposed using data-augmentation and weights to reduce the test error and increase the model parsimony simultaneously. From an empirical Bayesian point of view, our nondiagonal DA amounts to a degenerate multivariate Gaussian prior, where one degree of freedom is saved in the covariance matrix construction to better accommodate a less accurate initial estimate. Regarding the weighting technique, Theorem 3.2 generalizes the asymptotic lasso results [34,36] and provides a broad condition for weight construction. To the best of our knowledge, there are no nonasymptotic results available even in the lasso setting. (For adaptive weights used in orthogonal models, we refer to Zou [34] and She [24] (which also gave a correction to [34]) for some oracle results.) Hence a finite-sample theoretical analysis is an important topic of future research. Another different idea to achieve the same goal of joint accuracy and parsimony in finite samples is to use nonconvex penalizations [24]. The computational cost can be even more expensive. On the other hand, we can show (proof omitted) that substituting an appropriate thresholding operator (such as hard-thresholding or hybrid hard-ridge thresholding) for Θ, the accelerated annealing still applies for nonconvex penalized clustering models.
Finally, the scalable AA algorithm also raises some interesting open problems, such as the analysis of relaxation (I) and the rate of convergence studies. These problems are nontrivial in numerical analysis due to the nonexpansive nature and nonlinearity of the underlying operators.

Acknowledgements
The author is grateful to the anonymous referee and the associate editor for their careful comments and useful suggestions to improve the presentation of the paper. Most of this paper is based on [23], supported by NSF grant DMS-0604939. The author would like to thank Art Owen for his valuable guidance and for his help in revising an early version of this manuscript. The author also greatly appreciates valuable discussions with Bala Rajaratnam on the subject of this paper. For the optimization problem by the KKT optimality conditions [25],β is an optimal solution if and only if there exists a sgn(Tβ) such that The proof is obvious by noticing that • Proof of Proposition 3.2 Assume for the moment We first develop a √ n-consistent result similar to Knight and Fu [18] but in a general situation: In fact, from the KKT equation (A.2),δ √ n(β − β) satisfieŝ Noticing that f (δ) ⇒ g(δ), √ n(β − β) ⇒ δ o follows by Geyer [16].
• Proof of Theorem 3.1 First, it is easy to derive an asymptotic result similar to Lemma A.1: where δ o is nonrandom, defined by arg min

So the KKT equation for δ o is
Recall thatβ is an optimal solution if and only if (A.2) holds. Therefore, It follows that Again, condition (ii) is naturally satisfied because U z ′ T 1 = 0. (A.7) is equivalent to sgn(T 1β ) = (T 1 Σ −1 T T 1 ) + α + U z ′ η for some η. It is important to point out that even the original KKT equation (A.2) does not resolve the ambiguity of η (since T T 1 U z ′ η = 0 · η = 0). Hence a sufficient condition for T 1β = 0 is given by and a necessary condition for Tβ = 0 is ( On the other hand, Now we study the asymptotics. Necessity. Ifβ is zero s-consistent w.r.t. T 1 , then from (A.5), T 1 δ o = 0, and so n λ T 1β P → 0. In addition, with probability tending to 1, for any ǫ > 0. Observe that sgn(T 2β ) is bounded. There exists a subsequence indexed by n k such that sgn(T 2βn k ) → s with probability 1. By Proposition 3.1, we immediately know s ∈ Sgn(T 2 β). Thus with probability 1, for any ǫ > 0. The necessary condition follows. Sufficiency. Our goal is to show P ( ( Likewise, we can find a subsequence indexed by n k such that sgn(T 2βn k ) → s ∈ Sgn(T 2 β), n λ δ ′ → 0, n λ T 1β → 0, and Σ n k → C with probability 1. So we get P ( ( which contradicts the assumption. • Proof of Theorem 3.2 Define W = diag{w i }, W z = diag{w i } j∈z , W nz = diag{w i } j∈nz . Then the weighted sparse regression (3.7) just replaces the T in (3.1) by W T . Definê δ = a(n, λ)(β−β) for some sequence a(n, λ). Similar to the derivation of Lemma A.1,δ solves Following the lines of [34], one can prove that if (i) lim a √ n exist (say equal to a 0 ), (ii) n λa A → 0, and (iii) λa n B → 0, then where r ∼ N (0, σ 2 C). To guarantee that such a(n, λ) exists, it is enough to have nA λ ≪ n λB and nA thenβ is a(n, λ)-consistent, for any a satisfying (i), (ii), & (iii).
On the other hand, substituting W z T z for T 1 and W nz T nz for T 2 in (A.8), we obtain a sufficient condition for T zβ = 0: Clearly, by (A.10), (A.11), and (ii), this holds with probability tending to 1.
This gives another explanation of our approach from the penalty functions and we immediately know that (see, e.g., [3]) The KKT equation yields  [10], which makes use of Opial's conditions [21] in studying the nonexpansive operators, can not be directly applied, because the 2-norm of the operator K satisfying

It follows from Fact 3) that
is exactly 1, whereas Theorem 3.1 [10] requires K 2 < 1. Therefore, we need a generalization of Daubechies et al.'s (weak) convergence result. In fact, we can generalize Theorem 3.1 (or Proposition 3.11 to be more specific) in [10] which may also validate the over-relaxation technique used to speed the convergence. In this part, we will use some notation compatible with [10], with mild changes. (Our thresholding operator Θ(·; λ) uses a threshold value λ instead of λ/2.) Let and J I − K * K. The iterative process can be represented as where eig(J ) denotes any eigenvalue of J . Note that Φ SUR is still strictly convex in f and Proposition 2.1 [10] holds; in particular, for f opt = arg min f Φ SUR (f ; a) given a, Then it is easy to get Hence Φ(f n ) ↓ and the series With these facts, it is not difficult to write out the full proof for the (weak) convergence of f n for any K satisfying (C.7), by making corresponding changes in Lemma 3.5 and Lemma 3.7 [10]. The details are left to the readers.
The cooling schedule part of Proposition 4.2 is left to the proof of Theorem 4.2.
• Proof of Proposition 4.1 Represent the iteration of γ (j) by nonexpansive operators T k , Θ k , T k : To prove the finite-k sign consistency we introduce the following result: Back to our problem, observe that η k U T γ(k), η opt U T γ opt respectively solve min Define index sets z = {i : (γ opt ) i = 0}, nz = {i : (γ opt ) i = 0}. Given any index set I, we use U I to denote the submatrix of U composed of its corresponding rows such that (U α) I = U I · α, ∀α. Fact 5) states that η opt solves because U η opt = γ opt . Clearly, sgn ((U η k ) nz ) = sgn (γ opt ) nz for k large enough since U η k → γ opt . We claim that (U η k ) z = (γ opt ) z = 0 is also true for any k large enough. Otherwise, noticing γ opt is finite dimensional, there must exist some index sets nzz ⊂ z, and zz = z\nzz such that each component of (U η kj ) nzz is nonzero, and (U η kj ) zz = 0, for some subsequence η kj with k j → ∞ as j → ∞, which implies U ⊥ U T ⊥ γ(k j ) → 0. It follows that a further subsequence of η kj asymptotically solves for some sign vector b nzz (with each component ±1). Obviously, none of the rows of U nzz lies in the (row) space spanned by the row vectors of U zz . Excluding the case of degeneracy, the optimization problem does not have the same optimal solution η opt as (C.16). Hence the finite-k sign consistency holds. We also know for k large, η k solves (C.18) Note that (C.18) is a simple quadratic programming (QP) problem.
Let rz ⊂ z be one index set such that U rz has full row rank and rank(U rz ) = rank(U z ). Since η k always exists, the optimization problem (C.18) can be simplified into . Solving this QP, we obtain Note that since U rz has full row rank, (U rz A −1 U T rz ) −1 exists. Now it follows immediately that Letting k → ∞, we get the convergence rate of γ(k): γ(k)−γ opt = O(1/k 2 ).
• Proof of Theorem 4. 2 We prove the theorem for the general relaxed case in the form of (II), where 0 < ω < 2, with ω = 1 corresponding to the non-relaxed version; see (C.13). The operators introduced in (C.14) will be used for simplicity, except that Θ k is redefined by Θ k • v = Θ(v; ωλ/k 2 ), ∀v.
That is, γ(k ′ ) − γ(k) 2 ≤ I * + II * + III * (C. 25) It is easy to verify where ' ' means the component-wise '≤'. Therefore, Using Fact 3), we have Obviously, T k is nonexpansive. We claim that the set of fixed points of T k , denoted by F , is nonempty. In fact, let γ be a minimizer of the convex function Φ k defined by (C.4). The KKT optimality condition gives γ = Θ k • (J k γ + α k ) = T k • γ.