Thresholding-based Iterative Selection Procedures for Model Selection and Shrinkage

This paper discusses a class of thresholding-based iterative selection procedures (TISP) for model selection and shrinkage. People have long before noticed the weakness of the convex $l_1$-constraint (or the soft-thresholding) in wavelets and have designed many different forms of nonconvex penalties to increase model sparsity and accuracy. But for a nonorthogonal regression matrix, there is great difficulty in both investigating the performance in theory and solving the problem in computation. TISP provides a simple and efficient way to tackle this so that we successfully borrow the rich results in the orthogonal design to solve the nonconvex penalized regression for a general design matrix. Our starting point is, however, thresholding rules rather than penalty functions. Indeed, there is a universal connection between them. But a drawback of the latter is its non-unique form, and our approach greatly facilitates the computation and the analysis. In fact, we are able to build the convergence theorem and explore theoretical properties of the selection and estimation via TISP nonasymptotically. More importantly, a novel Hybrid-TISP is proposed based on hard-thresholding and ridge-thresholding. It provides a fusion between the $l_0$-penalty and the $l_2$-penalty, and adaptively achieves the right balance between shrinkage and selection in statistical modeling. In practice, Hybrid-TISP shows superior performance in test-error and is parsimonious.


Introduction
Lasso [30] has attracted people's a lot of attention recently because it provides an efficient and continuous way for variable selection, thereby achieving a stable sparse solution. Although in the orthonormal case it is well understood and has elegant theories [3,7,12], its shrinking and thresholding are not direct for a general regression matrix, and it suffers some problems in both selection and estimation [8,37,39]. There has been a large and rapidly growing body of literature for the lasso studies over the past few years. The efficient procedures proposed for solving the lasso include the well known LARS (Efron et al. [13]), the homotopy method (Osborne et al. [26]), and a recently re-discovered iterative algorithm (Fu [17] Daubechies et al., Friedman et al. [16], Wu & Lange [33]). As for the theoretical aspects of the lasso, we refer to Knight & Fu [23], Zhao & Yu [37], Donoho et al. [11], Bunea et al. [5], Zhang & Huang [35], etc. for asymptotic and nonasymptotic results. Various extensions and modifications to lasso have also been proposed, such as the grouped lasso (Yuan & Lin [34]), the Dantzig selector (Candès and Tao [9]), the adaptive lasso (Zou [38]), and the relaxed lasso (Meinshausen & Yu [25]).
This paper aims to improve the naïve l 1 -penalty, by using nonconvex penalties, to achieve an effective and efficient procedure for model selection and shrinkage. The rest of the paper is organized as follows. Section 2 provides a mechanism to borrow the rich nonconvex penalties in the orthogonal design to solve the general problem. From the point of view of thresholding rules, Section 3 constructs the thresholding-based iterative selection procedures (TISP) for model selection and successfully builds the convergence theorem. Section 4 investigates the theoretical properties of the selection and the estimation via TISPs nonasymptotically. In Section 5, we carry out an empirical study of TISP design which leads us to a novel Hybrid-TISP proposed based on hardthresholding and ridge-thresholding. It provides a fusion between the l 0 -penalty and the l 2 -penalty, and adaptively achieves the right balance between shrinkage and selection in statistical modeling. In practice, Hybrid-TISP shows superior performance in both test-error and sparsity. Section 6 gives a real data example. All technical details are left to the Appendices.

Motivation -From Orthogonal Designs to Non-orthogonal Designs
We consider the penalized regression problem min β 1 2 Xβ − y 2 2 + P (β; λ)( f (β)), (2.1) where X = [x 1 , x 2 , · · · , x p ] is the regression matrix, y ∈ R n is the response vector, and P (β; λ) represents the penalty with λ as the regularization parameter. Here p may be greater than n. In this paper, we assume β is sparse, and use (2.1) for predictive learning. Although predictor error or accuracy is our first concern, we prefer to obtain a parsimonious model that is more interpretative in practice and is consistent with Occam's razor. Usually P is assumed to be an additive penalty in the sense that P (β; λ) is obtained by a univariate P : P (β; λ) = P (β i ; λ). This sparsity problem has wide applications in variable selection, functional data analysis, graphical modeling, compressed sensing, and so on.
If P (β; λ) = λ β 1 , then (2.1) is the lasso [30], a basic and popular method in variable selection. However, although the l 1 -norm provides the best convex approximation to the l 0 -norm and is computationally efficient, the lasso cannot handle collinearity [39] and may result in inconsistent selection (cf. the irrepresentable conditions [37]) and introduce extra bias in estimation [25].
On the other hand, if we concentrate on orthogonal designs only, i.e., X T X = I, like in wavelets, l 1 is far from the only choice. There are established theories and algorithms for various types of (nonconvex) penalties.
It is interesting to note that all three lead to the same estimator obtained by hard-thresholding.
In this simplified setup, (a) the fitting part of the penalized regression (2.1) is separable in this case, which means we only need to deal with the univariate case, if P is also separable (which is true in general); (b) even if P is nonconvex, it still often results in a unique solution.
One of our main goals in this paper is to borrow these rich results in the orthogonal design to help us solve the general problem (2.1). We use the following mechanism to achieve this. Define g(β, γ) = 1 2 Xγ − y 2 2 + P (γ; λ) + Here < a, b >= a T b, Σ = X T X. Given β, minimizing g over γ is equivalent to arg min In contrast to (2.1), this problem has an orthogonal design -as mentioned earlier this is easier to handle both in computation and in theory. For example, we may adopt some nonconvex penalties, and they still result in a unique solution of γ. Given γ, minimizing g over β is equivalent to arg min Taking its derivative with respect to β gives (I − Σ)(β − γ) = 0, from which it follows that β = γ if Σ 2 < 1. Note that (2.4) is a convex optimization. Therefore, the optimal value of g is always achieved at γ = β if X is scaled down properly. The connection to the original problem is now clear: it is easy to verify min β g(β, β) is equivalent to min β f (β). The advantage of optimizing g instead of f is that given β, the problem is orthogonal and separable in γ, and we can adopt far more flexible penalties in the algorithm design, including the nonconvex ones.

Thresholding Rules and Penalties
As the title suggests, our starting point in this paper is thresholding rules rather than different forms of the penalty function. One direct reason is that different P 's may result in the same estimator and the same thresholding, say, in the situation of hard-thresholding [1,14]. Moreover, starting with thresholding functions facilitates the computation (as will be shown in the next subsection). Besides, there is also a universal connection between thresholding rules and penalty functions that we will investigate in this subsection. For convenience, we consider the univariate case only.
Note that (3.2) is not the only way to construct a penalty that leads to Θ in solving the optimization. For example, in the situation of hard-thresholding, in addition to the continuous penalty are also valid choices [1,14]. In some sense, (3.3) may be considered as a continuous version of the discrete l 0 -penalty.

TISP and Its Convergence
Now we go back to the mechanism introduced in Section 2 for the penalized multivariate regression problem (2.1), with P constructed from a given thresholding function Θ. Solving (2.3) yields γ = Θ((I − Σ)β + X T y; λ). Seen from (2.4), our iterates simplify to This iterative procedure is referred to as the Thresholding-based Iterative Selection Procedure (TISP). TISP provides a feasible way to tackle the original optimization (2.1). It is a simple procedure that does not involve any complicated operations like matrix inversion. There are rich examples for the procedure defined by (3.5). (a) Using a softthresholding in (3.5), we immediately obtain the iterative algorithm (in vector form) for solving the lasso problem where P (β; λ) = λ β 1 [10]. In fact, the asynchronous updating of (3.5) leads exactly to the component-by-component iteration referred to as the coordinate decent algorithm (see Friedman et al. [16]). The corresponding pathwise algorithm has been considered to be the fastest in solving the lasso problem to date, especially when p > n. (b) If we substitute hard-thresholding for Θ, seen from (3.4), it is an alterative optimization for solving the penalized regression with i.e., the l 0 -penalized regression problem. (c) We can also replace the hardthresholding by the more smoothed SCAD to reduce instability. (d) Finally, it is worth mentioning that TISP may also include the ridge penalty P (β; λ) = λ β 2 2 /2, if we set thanks to the generic definition of a thresholding function.
Obviously, if Σ is nonsingular, and so n > p, the TISP mapping is a contraction and thus the sequence β (j) converges to a stationary point of (2.1). We would like to apply TISP to large p problems as well where Σ is singulara surprising fact is, however, that TISP may not be a nonexpansive operator 1 for most thresholdings (except soft-thresholding), let alone a contraction. The following studies cover the large p case (p > n). We use µ(A) to represent an arbitrary singular value of matrix A, and µ max (A) (µ min (A)) the max (min) of µ(A), respectively.
Without loss of generality, suppose the penalty function defined by (3.2) satisfies the bounded curvature condition (BCC) for some symmetric matrix H: where s = s(β; λ) is given by (3.1). Many thresholding rules of practical interest including Example 1-3 satisfy the BCC with a positive semi-definite H. For instance, for soft-thresholding, H = 0 since β 1 is convex; for hard-thresholding, H = I; for SCAD-thresholding, we can take H = I/(a − 1) (recall that the parameter a is assumed to be greater than 2 in Example 2, and so H is positive definite).
Moreover, if µ max (Σ) < 1 ∨ (2 − µ max (H)), there exists a constant C > 0, dependent on X, H only, such that Therefore, for an arbitrary X, we can use TISP of the following form in practice where k 0 = µ max (X) = X 2 , although larger values of k 0 generally lead to faster convergence. Applying Theorem 3.1 to some interesting special cases gives the following corollaries.
Corollary 3.1 generalizes the lasso result by Daubechies et al. [10], and coincides with our previous study [27]. Corollary 3.3 covers the orthogonal case, since SCAD assumes a > 2 and thus 2 − 1 a−1 > 1. Finally, it is worth pointing out that TISP may not always be an MM algorithm [21] like the LLA method by Zou & Li [40]. Take the SCAD-thresholding as an example: when does not majorize f but TISP converges. Theorem 3.1 implies that if X is scaled down properly (which does not affect the variable selection), f (β (j) ) is nonincreasing all the time during the iteration process.
We can easily show a result similar to Zou & Li [40]: . Give an initial point β(0), if β * is a limit point of the TISP sequence β (j) , then β * is a stationary point of f (β) (2.1), or equivalently, a fixed point of (3.5).
Denote by F the set of the fixed points of TISP. That is, given any β * ∈ F , it satisfies the implicit equation (3.11) referred to as the Θ-equation. Clearly, local minima of f are fixed points of (3.11). In the next section, we will perform an nonasymptotic study of the good properties of the points in F . Here, we give the following optimality result. Although the fact that nonconvex penalties often result in a unique optimal solution in the orthogonal design is well known, this proposition states (novelly) that the same conclusion holds as long as X is not too far from orthogonal (characterized in terms of H). For instance, for SCAD thresholding and penalty, TISP necessarily leads to the global minimum of f , provided 1 √ a−1 ≤ µ(X) ≤ 2 − 1 a−1 , or 0.61 ≤ µ(X) ≤ 1.27 when a = 3.7 (the default choice in SCADsee Example 2), given any initial point β (0) . In summary, TISP is a successful algorithm for solving the penalized regressions for a general design matrix.

Related Work
The main contribution of this paper is to consider a new class of Θ-estimators defined by the Θ-equation (3.11) for model selection and shrinkage, which can be naturally computed by TISP, and are associated with penalized regressionsin particular, the penalty P can be constructed via the three-step procedure introduced in Section 3.1. More generally, the λ in (3.11) can be componentspecific. For example, if X is not column-normalized, we may use where λ = λ x 1 2 λ x 2 2 · · · λ x p 2 T and λ is a regularization parameter. With a carefully designed Θ, we obtain a good estimator with both accuracy and sparsity, as will be shown in Section 5.2. The corresponding penalty is, not surprisingly, nonconvex, which indicates the difficulty of this NP-hard problem. Nonconvex penalties have been successfully used in real-world applications like high-dimensional nonparametric modeling [3], survival analysis [6], and microarray data analysis [32,36], where they achieve outstanding performance. The numerical optimization has been a challenging and intriguing problem. In the context of wavelet denoising where XX T = I, Antoniadis & Fan [3] proposed the ROSE to approximately solve the minimization problem for a wide class of nonconvex penalties. They also introduced the graduated nonconvexity (GNC) algorithm, developed in image processing; it has a number of tuning parameters and is computationally intensive. Fan and Li [15] then proposed a generic local quadratic approximation (LQA) algorithm by solving a series of l 2 -penalized problems. Like ridge regression, this approach does not intrinsically yield zeros, and setting a small cutoff value during iteration has been shown to be too greedy. A refined version is the perturbed LQA suggested by Hunter & Li [22] to avoid numerical instability. The perturbation parameter needs to be chosen very carefully in implementation since it affects the sparsity of the solution as well as the speed of convergence. Recently, Zou & Li [40] proposed a new local linear approximation (LLA) which significantly improves the LQA. Explicit sparsity is attained by solving a weighted lasso problem at each iteration step. (Note that our TISP does a simple thresholding at each step.) One-step SCAD estimator is advocated. Our empirical studies show that this one-step convex approximation has limited power in finite samples. Although the estimate is sparser than using the plain l 1 -penalty, it may result in misleading models with poor prediction error. See Section 5 for detail.
Using thresholding rules to define Θ-estimators shares similarities to the studies of M -estimators of ψ-type in robust regression. Most M -estimators were proposed in the form of ψ-functions but not based on loss functions, such as Huber's, Hampel's three-part, and Tukey's bisquare M -estimators. Indeed, we find an interesting connection between these two fields. Assume a mean shift outlier model, y = Xβ + γ + ǫ, ǫ ∼ N (0, σ 2 I), where n > p and γ ∈ R n is sparse. If γ i is nonzero, case i is an outlier. Let H = X(X T X) −1 X T be the hat matrix and suppose its spectral decomposition is given by H = U DU T . Define an index set c = {i : D ii = 0} and U c is formed by taking the corresponding columns of U . Then a reduced model can be obtained from the mean shift outlier model After gettingγ from TISP, we can estimate β byβ = (X T X) −1 X T (y −γ).
Simple algebra shows that this special Θ-TISP solves an M -estimation problem associated with ψ, if (Θ, ψ) satisfies Θ(t; λ) + ψ(t; λ) = t. It is well known that Huber's method (or equivalently, Soft-TISP, which corresponds to using a convex l 1 -penalty on γ) behaves poorly in outlier detection even for moderate leverage points. Instead, redescending ψ-functions are advocated, which corresponds to using nonconvex penalties for the sparsity problem of (3.13).

Selection and Estimation via TISP
TISP provides a very simple way to do variable selection via penalized regressions. In this section, we will perform a theoretical study of the variable selection and coefficient estimation by TISPs based on different thresholdings. Our results are nonasymptotic.

Sparsity Recovery
To study the sign-consistency of a TISP estimate, we denote by p s the probability of successful sign recovery, that is, the probability that there exists aβ ∈ F such that sgn(β) = sgn(β).
To simplify asymptotic discussions, we assume X has been scaled to have all column l 2 -norms equal to √ n. Define Σ (s) = Σ/n. To get a better form of the bounds for p s , we define two quantities µ = µ min (Σ (s) nz,nz ) and κ max Intuitively, κ measures the 'mean' correlations between the relevant predictors and the irrelevant predictors. The following nonasymptotic result is always true regarding the selection via TISP.
where ϕ is the standard normal density.
Clearly, the size of κ is very important. A small value of κ weakens the interference of X z and X nz and helps recover the sparsity correctly. We can also use this theorem to explore some asymptotics. (i) Assume β, d z , and d nz are fixed, n → ∞, then under some regularity conditions we get: if τ / √ n → ∞ and τ /n → 0, then TISP is sign consistent. This result in the Soft-TISP (lasso) case coincides with other studies like [23,37]. (ii) Suppose β nz and d nz are fixed, n, d z → ∞, and µ ≥ (1 + ǫ)κd nz for some ǫ > 0. Then TISP can successfully recover the sparsity pattern of β if d z ϕ(M )/M → 0 and τ /n → 0, which only requires n to grow faster than log d z .
Unfortunately, the regularity condition µ ≥ κd nz cannot be removed in general. In the lasso case, it is a version of the irrepresentable conditions [37]. (We took this more restrictive form because it is more intuitive and leads to more nice-looking bounds in (4.3) and (4.4).) However, for hard-thresholding-like Θ's, this is unnecessary and we can obtain stronger results.
We say that Θ belongs to the hard-thresholding family if Under the conditions of Theorem 4.2, we have , which is usually true for both hard-and scad-thresholding. Therefore the TISP induced by a Θ in the hard-thresholding family can achieve better performance in variable selection. 2 This will be verified empirically in the next section. Note that although in the orthogonal case, hard-thresholding and soft-thresholding give exactly the same zeros, they result in very different sparsity patterns in our iterative procedure for a nonorthogonal X.

Estimation Risk
We obtain the following TISP risk bounds for any thresholding Θ.
where M is defined as in Theorem 4.1, in which we assume κ 2 ≤ µν dzdnz and µ ≥ κd nz . This general result holds for any TISP estimate. It is not difficult to show that in the previous setting of (i) where β is fixed, we also obtain R z → 0 and R nz → 0 by the theorem if τ / √ n → ∞ and τ /n → ∞. Besides, under the conditions stated in the theorem, M = 2 log [12] in the orthogonal design implies this risk bound can not be improved significantly in general. We leave the TISP design problem to the next section using an empirical study.
In the orthogonal case, we can show the oracle inequalities [12] hold.
This nonasymptotic result covers soft-, hard-, and SCAD-thresholdings. It coincides with the classical soft-thresholding studies [12] and is sharper than [3,38]. (A correction of Zou's oracle bound [38] is also given at the end of the proof; see Appendix A.4.)

A numerical study of TISPs
In this section, we demonstrate the empirical performance of TISPs by some simulation data. Although there are rich choices about Θ in (3.5), we focus on three basic TISPs only in this subsection. In addition to the Soft-TISP, i.e., the lasso, we implemented Hard-TISP and SCAD-TISP, the thresholdings of which belong to the hard-thresholding family. The parameter a in SCAD-thresholding takes the default value, 3.7, based on a Bayesian argument [15]. As seen from the theoretical studies in Section 4, the last two should perform better than the lasso in variable selection. In generating the solution path for a grid of λ-values, we always set the initial point, β (0) , to be zero in Hard-or SCAD-TISP. A natural search range for λ, seen from (3.11), is [0, X T y], if X has been column normalized. (Note that a pathwise algorithm with warm start, which takes the previous estimate associated with the old value of λ as the initial point of the procedure for the current value of λ, may be inappropriate for TISPs when nonconvex penalties are used. In fact, the solution path associated with a nonconvex penalty is generally not continuous in λ and warm-start leads to bad solutions because of multiple local minima effects.) For comparison, the one-step LLA method, proposed by Zou & Li [40] for penalized likelihood models, is also included in our tests. They showed good asymptotics about one-step SCAD when n → ∞ and p is fixed, and demonstrated its performance in various numerical examples. The one-step LLA is actually a weighted lasso with weights constructed from the OLS estimate using different penalty functions. According to our general result of weights in sparse regression [27], it can achieve better sign consistency than the lasso as n grows to infinity. We are greatly interested in drawing a comparison between TISP and LLA since TISP also successfully solves the penalized regression problems.
We did experiments on two simulation datasets. Each dataset contains training data, validation data, and test data. We use # ="·/ · /·" to denote the number of observations in the training data, validation data, and test data. Let Σ be the correlation matrix in generating X, i.e., each row of X is independently drawn from N (0, Σ). We use ({a 1 } n1 , · · · , {a k } n k ) to denote the column vector made by n 1 a 1 's, · · · , n k a k 's consecutively in the following examples. Each model is simulated 50 times, then, we measure the performance of each algorithm mainly by test error and sparsity error. The test error is characterized by the 40% trimmed-mean of the scaled MSE (SMSE) on the test data, where SMSE is 100 · ( N i=1 (ŷ i − y i ) 2 /(N σ 2 ) − 1) defined for the test data. (Medians of MSEs are mostly used [30,39] to measure the performance from multiple runs, but are not so stable for comparisons based on our experience.) The sparsity error here is defined by the 40% trimmed-mean of the following 50 percentages: 100 · |{i : sgn(β i ) = sgn(β i )}|/d, which represents the number of inconsistent signs for each estimate compared to the true β. We also summarized the proper zero percentages, 100% · |{i : β i = 0,β i = 0}|/|{i : β i = 0}|, and the proper nonzero percentages, 100% · |{i : β i = 0,β i = 0}|/|{i : β i = 0}| in the table as follows. The numbers in parentheses are the standard errors of the trimmed means of SMSE, estimated by bootstrapping the SMSE 500 times as in [38]. The total computing time (in seconds) for each algorithm is also included.

Lasso
One-step SCAD Performance comparisons on the simulation data, in terms of test error, sparsity error, proper sparsity, and proper nonsparsity -all the numbers are 40% trimmed-mean of the 50 simulations. Six methods are listed here: lasso (Soft-TISP), one-step SCAD, Hard-TISP, SCAD-TISP, elastic net (eNet), and Hybrid-TISP; the last two both have two regularization parameters. The last row gives the total computing time in seconds.
First, although Zou & Li's one-step SCAD brings more sparsity than the lasso estimate (seen from the proper-sparsity and proper-nonsparsity), it is often the worst in terms of test error. This is because the one-step SCAD is indeed a weighted lasso method and the OLS estimate used for weight construction may not be trustworthy, if, say, there is large noise, or high correlation between some variables. This phenomenon is serious in Example 2 where the OLS estimate can be unstable and misleading. Our Hard-TISP and SCAD-TISP clearly showed the remarkable parsimoniousness brought by nonconvex penalties. Instead of solving a l 1 -constrained convex approximation as in the LLA method, our TISPs directly tackled the original nonconvex penalized regressions and demonstrated better performance in both test-error and sparsity-error. (In fact, we doubt if the l 1 -based one-step SCAD is truly able to solve the SCAD penalized regression, seen from the convex approximation in its derivation, and after comparing its estimate to the SCAD-TISP.) Hard-TISP and SCAD-TISP do not differ much here, which verifies the previous theoretical results regarding the hard-thresholding family in Section 4.
Hard-TISP and SCAD-TISP achieve smaller test error than the lasso which may introduce extra bias when the signal-to-noise ratio is medium or high. Interestingly, when the noise level is very high, the lasso (Soft-TISP) yields a more accurate estimate than the two. This is in fact not so surprising. In predictive learning, to reduce the test error, when the noise is relatively large compared to the signal, it is also necessary to shrink the nonzero coefficients even if the true ones are far from zero. In either hard-or SCAD-thresholding, there is basically no shrinkage offered for large nonzero coefficients, while the lasso does this by soft-thresholding (although the shrinkage amount is the same as the thresholding value). Fortunately, TISP still gives us good selection results and achieves parsimonious models. We can apply, for example, a second-time shrinkage to the coefficients of the selected variables. Of course, a better strategy is to take into account these two concerns -selection and shrinkagesimultaneously and adaptively in building a model as probed in the next subsection.

Hybrid-TISP for model selection and shrinkage
To deal with the low SNR problem, a promising approach is to modify the thresholding in Hard-TISP to include adaptive shrinkage for nonzero coefficients. Motivated by the thresholding function of ridge regression given by (3.6), we propose a novel hybrid -thresholding: The penalty constructed via the mechanism introduced in Section 3.1 is made up of two quadratic parts: We have seen the first quadratic part in the continuous hard-penalty (3.3) (which leads to the same solution as the discrete l 0 -penalty); the second part resembles a ridge penalty. See Figure 1 below. Note that the knots ±λ/(1 + η) are dependent on η, too. Simple calculations show that this P satisfies the BCC (cf. (3.7)) with H = I, and Theorem 3.1 holds. We can apply (3.10) given an arbitrary design matrix. The corresponding TISP (referred to as Hybrid-TISP) converges. The Θ-equation (3.11) implies the nonzero components of a Hybrid-TISP estimate result from a partial ridge regression. This fact can be used in implementation when the maximum number of iterations allowed has been reached. The penalty defined by hybrid-thresholding. As λ and η vary, it takes the continuous hard-penalty and the ridge penalty as extremes.

Hybrid-TISP successfully offers both selection and shrinkage in estimating β.
Before going into the numerical results, we summarize the traits of the design of Hybrid-TISP as follows. (a) Its penalty provides us a trade-off between the l 0 -penalty and the l 2 -penalty (ridge-penalty), and takes the two as extremes, from which we secure selection and shrinkage simultaneously. In particular, the selection is achieved by a penalty more like l 0 than l 1 , seen from the penalty function, or the iterative thresholding. (b) Hybrid-TISP avoids double shrinkage. Double shrinkage is a serious problem in the design of naive elastic net [39] which simply adopts a linear combination of the l 1 -penalty and the l 2 -penalty. However, the l 1 -penalty also plays a role in shrinking the nonzero coefficients in addition to the l 2 -penalty. By contrast, Hybrid-TISP deals with the zeros and the nonzeros separately, by hard-thresholding and ridge-thresholding, respectively; there is no overlapping between them. (c) We have two parameters, λ and η, responsible for selection and shrinkage respectively. One drawback of the lasso is that it uses the same parameter to control both selection and shrinkage [24]. Therefore, it may result in insufficient zeros even if the SNR is pretty high, as shown clearly in Table 1. Hybrid-TISP has λ, η designed for the two different purposes and can adapt to different sparsity and noise level. (d) The TISP selecting and shrinking interplay with each other during the iteration till in the end we successfully achieve selection/shrinkage balance in the final estimate. This is in contrast to the relaxed lasso [24] which treats selection and shrinkage as separate steps in building a model. (e) Finally, Hybrid-TISP is a very simple procedure to implement. It only involves multiplication and thresholding operations.
In the implementation of Hybrid-TISP, an empirical parameter search is usually needed to determine the values of λ and η, because running a grid search over the (λ, η)-space is a formidable task. We search along a couple of few onedimensional solution paths including the λ-paths (with η fixed) and the η-paths (with λ fixed) to save computational cost. The optimal tuning parameter from the ridge regression path (corresponding to λ = 0), denoted by η (r) , is used as a reference for η. Briefly, our search process generates and searches along some λ-and η-paths, compares the results from these searches, and then takes (λ, η) to be the one minimizing the validation error. The concrete search paths are as follows. (i) n > p. Denote the OLS scale estimate byσ. If n/p < 5, or n/p < 10 butσ > 5, we adopt the alternative search strategy which has been shown to be fast and efficacious [27]: fixing η at 0.5η (r) , search along the λ-path to get an optimal solution (having the smallest validation error) at, say, λ (o) ; then search along the η-path with λ fixed at λ (o) . If n/p > 10 andσ < 5, we only search over the λ-path with η = 0.05η (r) . In all remaining cases, we generate and search long two λ-paths with η = 0.5η (r) and 0.05η (r) respectively. (ii) p > n. We use the above alternative search starting with 0.5η (r) , and an additional search for λ with η fixed at 0.05η (r) . Accordingly, 3 paths in total are generated in the large-p situation. This simple empirical search does not cover the full parameter space but is more efficient than a grid search. The results are reported in Table  1. We also included the elastic net (eNet) in the experiments, which has two regularization parameters as well. Note that eNet generates and searches along 6 solution paths to tune the parameters [39].
Seen from Table 1, Hybrid-TISP has amazing performance in both accuracy and sparsity. We briefly summarize the story as follows. When the noise level is low or medium, the value of λ in the lasso is limited by the amount of shrinkage and thus gives insufficient sparsity. Large noise alleviates the problem but there is still much room for the improvement of test-error and sparsity-error because the amount of shrinkage may not equal to the thresholding value in the selection. The weighted lasso like the one-step SCAD has somewhat limited power because the OLS estimate may be inaccurate and misleading for weight construction. Benefiting from the l 2 -penalty, the eNet shows much better accuracy in the case of large noise and/or high correlation between the variables; nevertheless, the sparsity of the estimate may be seriously hurt when the ridge penalty must take control. And it seems possible to improve its test-error further by incorporating this sparsity in estimation. All of these problems can be resolved by Hybrid-TISP, which achieves the right balance between shrinkage and selection. Its test error is consistently lower than the eNet, and more importantly, Hybrid-TISP provides a parsimonious model as Hard-TISP.

Large sample and large dimension experiments
At the end of this section, we demonstrate the performance of TISP on large-n data as well as on large-p data. We modified the parameters in Example 1 and reran the simulations, where Σ ij = ρ |i−j| with ρ = .5, β is appended with zeros given by [3, 1.5, 0, 0, 2, 0, 0, · · · , 0] T , σ = 2, 5, and n, d are not fixed anymore: in the large sample experiment, d = 8, n = 40, 80, 200 (corresponding to 5 times, 10 times, and 25 times as large as d); in the large dimension experiment, n = 20, d = 100, 200, 500 (corresponding to 5 times, 10 times, and 25 times as large as n). Table 2 shows the simulation results of these different combinations of n and d. In both situations, the Hybrid-TISP path is preferable in terms of accuracy and sparsity. Our conclusions are similar to the findings summarized before. Note that one-step SCAD uses the OLS estimate as the initial guess and thus is not included in the large-p simulation. In fact, as an example of the adaptive lasso, it is most powerful in large samples with small noise and low correlation between covariates, where the OLS estimate is accurate. The elastic net is an improvement of the lasso and provides a good algorithm in predictive learning. However, despite having two regularization parameters, it does not improve much the sparsity of the lasso. Hard-and SCAD-IPOD give significantly different solution paths than the above convex penalties. They may dramatically reduce the sparsity error and the test error, say, for large-p sparse signals with moderate noise. Both thresholdings fall into the hard-thresholding family which does not introduce much estimation bias for large coefficients. Interestingly, if our main concern is to reduce the test error in building a statistical model (which is the most frequently used tuning criterion in implementation), they are not always our best choices. Indeed, it is more desirable to offer adaptive shrinkage to nonzero coefficient estimation to benefit from the bias-variance tradeoff. Hybrid-TISP is successful especially for the large-p data because it does joint and adaptive selection and shrinkage.

Real Data
Hybrid-TISP was applied to a real prostate dataset which was used by Tibshirani [30]. The prostate data have 97 observations and 9 clinical measures. In this example, unlike [30], we take the log(cancer volume) (lcavol) as the response variable and consider a full quadratic model; the 43 predictors are 8 main effects, 7 squares, and 28 interactions of eight original variableslweight, age, lbph, svi, lcp, gleason, pgg45, and lpsa, where svi is binary. The lasso does not give stable and accurate results for this example due to the existence of many highly correlated predictors.
The regularization parameters of Hybrid-TISP were tuned by leave-one-out cross-validation. To identify the relevant variables in a trustworthy way, nonparametric bootstrap resampling was used with B = 100. For every bootstrap dataset, after standardizing the predictors, we apply Hybrid-TISP with fixed regularization parameters tuned for the original dataset. Figure 3 shows the percentages of the bootstrap coefficient estimates being nonzero over the 100 replications for all the 43 predictors. The histograms are plotted in Figure 2. It is easy to see that 8 variables are much more significant than the others. In fact, these are exactly the variables selected by Hybrid-TISP on the original data. A more careful examination shows that they appear (jointly) 36 times in the selected models, the top visited octuple in bootstrapping. These variables fall into two groups with similar patterns: (I) {x 5 , x 19 , x 25 , x 38 }, i.e., {lcp, lweight*lcp, age*lcp, gleason*lcp}; (II) {x 8 , x 22 , x 28 , x 42 }, i.e., {lpsa, lweght*lpsa, age*lpsa, gleason*lpsa}. The within-group correlations are very high, > .98 for Group (I), and > .93 for Group (II). Furthermore, an interesting feature is that for any of the eight variables selected by Hybrid-TISP, the other three in the same group are most correlated with it among 42 predictors.

Discussion
We have proposed the thresholding-based iterative selection procedures for solving nonconvex penalized regressions. In fact, people have long before noticed the weakness of the convex l 1 -constraint (or the soft-thresholding) in wavelets and have designed many different forms of nonconvex penalties to increase model  Table 2 Performance comparisons on the simulation data with large sample size or dimensionality, in terms of test error, sparsity error, proper sparsity, and proper nonsparsity over 50 simulations. Six methods are listed here: lasso (Soft-TISP), one-step SCAD, Hard-TISP, SCAD-TISP, elastic net (eNet), and Hybrid-TISP; the last two both have two regularization parameters. sparsity and accuracy. But for a nonorthogonal regression matrix, there is great difficulty in both investigating the performance in theory and solving the problem in computation. TISP provides a simple and efficient way to tackle this.
Somewhat different than other studies, we started from thresholding rules rather than penalty functions. Indeed, there is a universal connection between them. But a drawback of the latter is its non-unique form: different penalties may result in the same estimator and the same thresholding. The main contribution of this paper is the study of a class of Θ-estimators satisfying (3.11), which can be naturally computed by TISP, and are associated with penalized regressions. With a carefully designed thresholding rule, we obtained a good estimator for model selection and shrinkage. Starting from Θ greatly facilitated the computation and the analysis. In fact, some penalty designs may even have a better explanation from Θ, or equivalently, the ψ-function -for example, the SCAD-penalty (recall that it is defined by its derivative) seems to originate from Hampel's three-part redescending ψ. Conversely, we can use TISP to compute M -estimators in robust statistics as described in Section 3.3.
Using a thresholding rule in the hard-thresholding family, TISP gives good  selection results. Our novel Hybrid-TISP, accomplishing a fusion between l 0penalty and l 2 -penalty based on the hard-thresholding and the ridge thresholding, shows superior performance and beats the commonly used methods in both test-error and sparsity. The hybrid penalty function (5.2) may look a bit odd, but is quite natural and simple from the point of view of thresholding; see (5.1). It is worth mentioning that in contrast to [2,3,19], where more than one tuning parameter is considered a drawback and unnecessity, we believe a good procedure should have two explicit regularization parameters to control and balance selection and shrinkage.
We assume the penalty function P is dependent on β and λ only. Therefore the iterative weighting, substituting the nonnegative garrote [19] for Θ in TISP, is not covered by the studies in this paper. In fact, with β involved in P , it might be difficult to optimize in the second step of the mechanism introduced in Section 2.
The solution path associated with a nonconvex penalty is generally not continuous in λ. For example, even for the transformed l 1 -penalty in Example 3 which is differentiable to any order on (0, +∞), the solution path still has no λ-continuity practically. Hence a pathwise algorithm is not appropriate here. Empirically, using a zero estimate as the start in nonconvex TISPs works pretty well. We conjecture that it leads to an estimate with some least norm property. Take Hard-IPOD as an example: this roughly means that we were looking for the local minimum of the l 0 -penalized regression that is closest to zero in building a parsimonious model.
The generalization of TISP to GLM seems straightforward; we will investigate this topic in the next paper. TISP fits perfectly into the Accelerated Annealing [27] and thus can be used in the generic sparse regression with customizable sparsity patterns, such as the supervised clustering problem. Other future studies include developing some acceleration techniques for TISP (like the relaxation and asynchronous updating [27]) and deriving some risk oracles in theory.
This inequality is due to the BCC (3.7). On the other hand, we know In summary, we get Then given β, we can write g as and apply (A.1) with α = (I − Σ)β + X T y, Correspondingly, for the TISP iterates β (j) , we have That is, (A.3) Now (3.8) and (3.9) can be obtained after simple calculations.

A.2. Proofs of Theorem 4.1 and Theorem 4.2
These theorems have all been essentially proved in [27]. We provide a selfcontained proof as follows. All inequalities and the absolute value '||' are understood in the componentwise sense. Assume, for the moment, X has been column-normalized such that the diagonal entries of Σ = X T X are all 1. Let Σ I = X T I X I , Σ I,I ′ = X T I X I ′ for any index sets I, I ′ . Lemma A.1. Assume Σ nz is nonsingular. The TISP estimateβ satisfies the following equations This can be obtained directly from (4.2). To bound P (V c ), suppose the spectral decomposition of Σ nz is given by . It follows that where ǫ ′′ 2 ∼ N (0, I dnz×dnz ). Hence, In fact, we can get something slightly stronger than (A.7). Observing that X ′ z T ǫ is independent of X T nz ǫ, we have We assumed x T i x i = 1 i = 1, · · · , d in the above derivation. If the l 2 -norm of each column of X is no greater than σ max , we only need to replace β,β, τ , by β · σ max ,β · σ max , τ /σ max , respectively. The proof of Theorem 4.1 is now complete if σ max = √ n.

A.4. Proof of Theorem 4.4
Letβ H ,β S denote the hard-and soft-thresholding estimates with threshold value τ σ. It is easy to seeβ H ,β S , andβ all have the same sign andβ is sandwiched by the other two. Therefore, E β − β 2 2 ≤ E(max((β S i − β i ) 2 , (β H i − β i ) 2 )). It is sufficient to study soft-and hard-thresholdings in the univariate case.
Finally it may be worth mentioning that although applying Stein's lemma is one possible way (see, for example, Gao [19]), it does not handle the oracle bound well for an estimator very close to hard thresholding -like Zou's oracle bound for the adaptive lasso [38], because the hard-thresholding function is not weakly differentiable. (Due to an error made in the derivative calculation, Zou's oracle bound for the adaptive lasso defined by min 1 2 y−Xβ 2 2 +τ w i |β i | with w ′ i ∝ |β ols,i | −η should be (2 log n + 5 + 4η) · min(β 2 i , σ 2 ) + σ 2 /(2 √ π log n) , with the first factor being (2 log n + 5 + 4η) instead of (2 log n + 5 + 4/η), which diverges as η goes to infinity. See [27] for detail.) A.5. Proof of Theorem 5.1 In the proof, all inequalities and the absolute value '||' are understood in the componentwise sense. where k 0 = X 2 . The proof still follows the lines of the proof for Theorem 4.1. Assume, for the moment, X has been column-normalized such that the diagonal entries of Σ = X T X are all 1. Clearly,β z = 0, |β nz | ≥ λ k 2 0 +η is a sufficient condition for the zero consistency ofβ. From Lemma A.1, the Θ-equation is equivalent to S zβz = (X T z − Σ z,nz Σ −1 nz X T nz )ǫ + λΣ z,nz Σ −1 nz sgn(β nz ; λ