Selection by Partitioning the Solution Paths

The performance of penalized likelihood approaches depends profoundly on the selection of the tuning parameter; however, there is no commonly agreed-upon criterion for choosing the tuning parameter. Moreover, penalized likelihood estimation based on a single value of the tuning parameter suffers from several drawbacks. This article introduces a novel approach for feature selection based on the entire solution paths rather than the choice of a single tuning parameter, which significantly improves the accuracy of the selection. Moreover, the approach allows for feature selection using ridge or other strictly convex penalties. The key idea is to classify variables as relevant or irrelevant at each tuning parameter and then to select all of the variables which have been classified as relevant at least once. We establish the theoretical properties of the method, which requires significantly weaker conditions than existing methods in the literature. We also illustrate the advantages of the proposed approach with simulation studies and a data example.


Introduction
The penalized likelihood approach has been very popular for feature selection problems.Under the currently used framework, one first needs to compute the solution paths and then choose a tuning parameter based on a certain criterion.The solution yielded with the chosen tuning parameter is considered to provide the final estimates of the parameters.The problem of choosing a proper tuning parameter is notoriously difficult.In this paper, we propose to select features by utilizing the entire solution paths rather than searching for a single value of the tuning parameter.
Consider the following linear regression model: where y = (y 1 , . . ., y n ) T is an n-dimensional response vector, β * = (β * 1 , . . ., β * p ) T is a p-dimensional vector of regression coefficients, X = (x 1 , . . ., x p ) is an n × p design matrix, and ε = (ǫ 1 , . . ., ǫ n ) T is an n-dimensional vector of independent and identically distributed random errors.We assume that all of the variables in X are standardized, so that the coefficients in β * are on the same scale.For the linear regression problems described in (1), the penalized likelihood approach is equivalent to the penalized least squares regression, and the regression coefficients are estimated by minimizing the following objective function: where J(•) is a penalty function which controls the number of non-zero coefficients and λ > 0 is a tuning parameter.For the penalty function J(•), one could use a convex penalty like the lasso (Tibshirani, 1996) or the adaptive lasso (Zou, 2006) or one of the non-convex penalties such as the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li, 2001), the minimax concave penalty (MCP) (Zhang, Li and Tsai, 2010), or the truncated l 1 penalty (Shen, Pan and Zhu, 2012), among others.Some general selection criteria for the tuning parameter λ include crossvalidation (CV), generalized cross-validation (GCV), the Akaike information criterion (AIC) (Akaike, 1973), the Bayesian information criterion (BIC) (Schwarz, 1978), and the generalized information criterion (Atkinson, 1980).Chen and Chen (2008) pointed out that these criteria usually identify too many irrelevant features when the number of variables is large.This phenomenon has also been described by Broman and Speed (2002), Siegmund (2004), and Bogdan, Ghosh and Doerge Table 1 The general selection criteria in the statistics and machine learning literature in major journals.
Criterion Method References CV Adaptive lasso Zou (2006), JASA Fused lasso Tibshirani et al. (2005), JRSSB Truncated l 1 penalty Shen, Pan and Zhu (2012), JASA GCV Lasso Tibshirani (1996), JRSSB SCAD Fan and Li (2001), JASA BIC Lasso Wang, Li and Tsai (2007), JRSSB Lasso Yuan and Lin (2007), Biometrika EBIC Tilting Cho and Fryzlewicz (2012), JRSSB Group lasso Huang, Horowitz and Wei (2010), AOS (2004) in their studies of quantitative loci mapping.Chen and Chen (2008) proposed the extended BIC (EBIC), which promotes model sparsity by adjusting BIC with an additional penalty term for the growing number of parameters in the model.Recently, Sun, Wang and Fang (2013) developed a new technique via variable selection stability, which directly focuses on the selection of informative variables.Fan and Tang (2013) also proposed to select the tuning parameter by optimizing the generalized information criterion with an appropriate penalty depending on the dimensions in the model.Although the above criteria have been well studied for over a decade, there is no concurrence of opinion regarding which criterion to employ for choosing the tuning parameter.As examples, see Table 1 for a short list of publications in major statistics and machine learning journals and the different criteria they use.In fact, the feature selection procedure currently in use, which utilizes only one chosen value for the tuning parameter, suffers from the unavoidable drawback that it is often impossible to correctly identify all the features, no matter which criterion is used.
To overcome this drawback, we develop an innovative, intuitive approach which avoids choosing a tuning parameter.Instead, our proposed approach utilizes information from the entire solution paths to improve the selection accuracy.Thus we named the proposed procedure selection by partitioning the solution paths (SPSP).Besides improvement in selection accuracy, the SPSP approach can also achieve selection consistency under a weaker condition compared to the currently used framework.This is because SPSP does not require selection consistency at any specific values of the tuning parameter.As a matter of fact, we do not even require the estimation of coefficients shrunk to zeros.This brings another major advantage of the SPSP approach: we can achieve feature selection with an l 2 penalty or ridge regression.It is well known that ridge regression is more stable and handles collinearity better compared to the lasso.Now with the SPSP approach, we can enjoy these nice properties of the ridge regression without sacrificing capabilities of feature selection.Moreover, the minimizer of l 2 penalized likelihood function often has an explicit solution.Therefore SPSP also greatly reduces the efforts in algorithm development in presence of complicated likelihood functions.
The SPSP procedure is related to the stability selection approach proposed in the seminal discussion paper by Meinshausen and Bühlmann (2010).Their approach is based on the probabilities that variables will be selected, and the probabilities are obtained from a generic sub-sampling approach.Therefore, the stability selection approach does not require the selection of a tuning parameter either.However, unlike the proposed SPSP procedure, stability selection does not work with ridge regression.Moreover, the computational cost of the SPSP procedure is only a tiny fraction of that for stability selection, since no sub-sampling is involved.Finally, we find from simulation studies that stability selection tends to select too few variables and therefore produces a significantly higher false negative rate than SPSP.This work is also remotely related to Bayesian variable selection approaches, where the tuning parameters or candidate models are assigned a prior distributions and the posterior distributions of the models are evaluated.See, for example, Hoeting et al. (1999); Raftery, Madigan and Hoeting (1997); Posada and Buckley (2004); Barbieri and Berger (2004).
The rest of this article is organized as follows.Section 2 provides a simulated example to motivate the problem.Section 3 introduces the SPSP approach.Section 4 discusses the selection consistency of the SPSP procedure.Section 5 presents the results from various simulation examples.Section 6 provides an application of SPSP in a cancer study to detect the significant genes for glioblastomas, and Section 7 discusses the advantages and future potential of this work.Appendix provides technique proofs and additional simulation studies.

Selection by Partitioning the Solution Paths
In this section, we will illustrate the advantage and the idea of the proposed SPSP procedure with an example from model (1) in Wang et al. (2011), where p = 40, The rest of the coefficients β * j , j = 11, . . ., 40, are all zeros.The entries for the variables x j , j = 1, . . ., p, are generated from a multivariate Gaussian distribution whose marginals are the standard normal distribution.The pairwise correlation between the first 10 variables x j , j = 1, . . ., 10, is 0.9.The remaining 30 variables x j , j = 11, . . ., 40, are mutually independent and they are also independent of the first 10 variables.Furthermore, we generate errors from the normal distribution with mean 0 and standard deviation 3. The sample size is set to be n = 50.
We compute the solution paths and plot them in Figure 1.The dashed lines represent the solution paths for the non-zero coefficients and the solid lines represent those for the zero ones.The tuning parameters chosen by the twofold cross-validation, generalized cross-validation , AIC, BIC, and the extended BIC are shown by the vertical lines.Here, "BEST" is the tuning parameter which minimizes the number of incorrect selections (false positives + false negatives).We observe that cross-validation, generalized cross-validation, AIC, and BIC tend to select too many spurious variables and that the extended BIC tends to drop most of the important variables.Even for the "BEST" selection, the model excludes many non-zero coefficients.The problem is more evident when we focus on the three lines in the right panel of Figure 1.Here the two dotted lines (1 and 3) are the solution paths for two non-zero coefficients, while the solid line (2) is the solution path for a zero coefficient.Apparently, selecting a small λ, as AIC, BIC, and generalized cross-validation do, misleads us into identifying all three coefficients as non-zero.On the other hand, a large λ, as selected in crossvalidation, the extended BIC, and "BEST", incorrectly shrinks both coefficients of the important variables to zero.In fact, it is impossible to correctly identify all three features no matter which value of the tuning parameter we choose, although one can likely see the differences between the three features by visual inspection.For this example, we also provide the solution paths of elastic-net in the left panel of Figure 3, which shows that these traditional tuning parameter criteria also suffer from the same problem for elastic-net.
The current restriction of utilizing just one tuning parameter can seriously reduce the accuracy of the feature selection in general, since solution paths like those in Figure 1 happen quite frequently, and for any penalty we may employ.This is especially true when there are large correlations among the variables or the dimensions of the features are extremely high.
Such deficiencies as described above motivate us to develop an approach that allows for collectively using evidence from the entire solution paths, as opposed to letting results from a single value of λ dictate everything.We achieve this by first dividing estimates from each vale of λ into two clusters.Then we combine the cutoff points of all such clusters to form a boundary curve to partition the solution paths into two regions, as shown by the red curves in Figure 2. We call the region inside the two red curves the zero region and the region outside the two curves the non-zero region.Finally, we choose all variables which have been identified as relevant variables for at least one value of λ as the important features.We consider a feature to be unimportant if its solution path never goes out of the zero region.
It can readily be observed in Figure 2 that the SPSP procedure correctly selects 9 of the 10 relevant variables and drops all irrelevant ones, outperforming the results for any single value of λ in terms of selection accuracy.As a result, our approach can correctly identify the relevance of features like the three in Figure 1, labeled 1, 2, and 3. We also applied the SPSP procedure on the solution paths of elastic-net for the same example, as seen in the right panel of Figure 3. Another advantage of SPSP is that it does not require the coefficients of the unimportant variables to shrink to zero; therefore, it allows us to carry out feature selection with only a ridge regression.
The proposed SPSP procedure considers a feature important even if its solution path enters the non-zero region only once.The strategy may seem aggressive in identifying relevant variables.This is because we start the SPSP process rather conservatively, in the sense that we consider every variable "unimportant" for the smallest value of λ, when we initiate the partitioning process.The clustering at larger values of λ depends on the results for the previous λ.Thus, the SPSP procedure combines a conservative starting point with an aggressive selection strategy to optimize the selection accuracy.

Notations
Consider the penalized least squares problem in (2).We denote the index set for the non-zero coefficients as S = {j : β * j = 0}, with s = |S|, and the index set for the zero coefficients as S c = {j : β * j = 0}.The goal of feature selection is to correctly recover this sparsity pattern from the noisy observations in the model and estimate S. Once we specify the penalty function J(•) in (2), we can compute the solution path on a grid for the tuning parameters λ 1 < λ 2 < • • • < λ k .In practice, the grid is usually equi-distant on the log scale (Bühlmann and van de Geer, 2011;Shen, Pan and Zhu, 2012).For each λ k , we obtain a vector β(k) = ( β(k) 1 , . . ., β(k) p ) T of the penalized least squares estimators.A variable is more likely to be identified as relevant if its estimator is farther away from 0, regardless of the sign of the estimator.Therefore, we use the absolute values b In general, variables with a larger value of b(k) j are more likely to be important.We are therefore interested in finding a proper cutoff point T k = T (λ k ) such that the estimated relevant set Ŝk and irrelevant set Ŝc Then we define the adjacent distances between these ordered values as Note that D : j < p − ŝk + 1} be the largest adjacent distance in Ŝc k , and let be the largest adjacent distance under D max ( Ŝc k ).

Algorithm of SPSP
To identify an important feature j, j ∈ S correctly without introducing any spurious variables, the estimate β(k) j has to be larger than the estimates of zero coefficients at some λ k .But unlike the currently used methods of using just one tuning parameter, we do not require all of estimate β(k) j "concurrently" larger than all the estimates of zero coefficients at certain λ k .Rather we just need β(k) j to be relatively larger at some λ k for variable j to be selected.That is, even if β(k) j and β(k) j ′ , j, j ′ ∈ S take larger values (compared to the estimates of zero coefficients) at different λ k , we can still identify both j and j ′ without the cost of adding any false positive signals.The idea is to distinguish the larger estimates from the smaller ones at each λ k and consider those variables with the larger estimates as Ŝk .
The major issue therefore is to find a proper way to separate the larger and smaller estimates at each λ k .It is inappropriate to use a constant threshold because for larger λ k all the estimates might be small.However, the order of β(k) j , j ∈ S are still likely larger than that of estimates of zero coefficients.Following this intuition, we can set the threshold and guarantee that all the larger estimates are of the same order.As we can imagine, such a threshold happens where the adjacent distance (3) is relatively large.Therefore, instead of comparing all the β(k) j pairwise to find T k , we can equivalently try to find an adjacent distance that is large enough to separate Ŝk and Ŝc k .In principle, D( Ŝk , Ŝc k ), the gap between Ŝk and Ŝc k , should be sufficiently large to separate the irrelevant features from the important ones if 1. all β(k) j above the gap D( Ŝk , Ŝc k ) are of the same order, 2. all β(k) j above the gap D( Ŝk , Ŝc k ) are of higher order than those below the gap.It is not hard to verify that these are equivalent to the following statements in terms of adjacent distances: 1. the adjacent distances above the gap D( Ŝk , Ŝc k ) are not of higher order than the gap; 2. the adjacent distances below the gap D( Ŝk , Ŝc k ) are of lower order than the gap.For that reason, we consider D( Ŝk , Ŝc k ) to be large enough if it meets the following two criteria: where R is a constant which we can estimate from the data.
Based on this principle, we develop the SPSP algorithm as follows: Algorithm 1: Selection by Partitioning the Solution Paths (SPSP) 1 At λ 1 , set the initial values to T 1 = ∞, Ŝ1 = ∅, and Ŝc 1 = {1, . . ., p}, and estimate the constant p .Further, obtain . Therefore we also update where j is the location of D max ( Ŝc 4 Identify the union of all Ŝk as the index set for our selected relevant variables, i.e.Ŝ =
Remark 1. Principles (4) and ( 5) are implemented in Step 2.3.They are equivalent to the claim that the estimators of the non-zero coefficients should be of the same order and they should have a higher order than the estimators of the zero-coefficients.Instead of comparing every pair of the estimators, we just use adjacent distances for simplicity of computation.Therefore, finding the proper T k now transforms to finding an adjacent distance which is large enough-i.e. which satisfies (4) and (5)-to be the gap between the estimators for the zero coefficients and those for the non-zero ones.The constant R, which we obtain from the data in Step 1, is used to decide whether D max ( Ŝc k ) is large enough to replace the current D( Ŝk , Ŝc k ).As a matter of fact, our final selection results are not sensitive to the value of R, as evidenced by the simulation study conducted in Appendix 9.2.
Remark 2. In Step 2 we find, for each λ k , the cutoff point T k , the location of the gap which distinguishes the relevant and irrelevant variables, based on the results from λ k−1 .This not only simplifies the computation process, but also makes the boundary line T k = T (λ k ) smoother to avoid unstable selection results.Specifically, in Step 2.1, we first use the largest estimated coefficients among those identified as "zero coefficients" for λ k−1 as the current boundary.This takes care of the case where some coefficients in Ŝk−1 become small and enter into the zero region at λ k .In Step 2.2 and Step 2.3, we decide whether any adjacent distances within Ŝc k are large enough to be considered as the new gap between the zero and non-zero coefficients.This handles the scenario where there are too few variables in Ŝk , so that a "large" gap still exists.
Remark 3. When the algorithm starts, with λ 1 , we use the initial values set in Step 1, with all variables considered "irrelevant".In other words, we are conservative about identifying the relevant variables at the start of the procedure.This is because we implement an aggressive selection strategy in Step 4 by using the union of all Ŝk s as our estimator Ŝ for the set of the relevant variables.This effectively balances out the very conservative initial setup to achieve the best selection accuracy.Another reason for the conservative initial setup is that the estimations at the small λ k parameters are usually unstable, as there might be too many non-zeros in βk since the shrinkage effects of the penalty are minimal.As a result, even the estimation of a zero coefficient can possibly be rather large.
Once we identify the index set of all the relevant variables Ŝ, we estimate the regression parameters β Ŝ in a model that only includes the features that have been selected, y = X Ŝ β Ŝ + ε, where X Ŝ = (x j ) j∈ Ŝ and β Ŝ = (β j ) T j∈ Ŝ .In most cases, the number of features in Ŝ is smaller than the sample size, so we just use the least squares estimator as β Ŝ .If the number of selected variables is larger than the sample size, we can use a ridge regression with a small shrinkage factor.
One unique advantage of this procedure is that it can be applied not only to penalties like lasso, adaptive lasso, SCAD, and MCP; it can also be applied to penalties which do not produce the sparse solutions, such as the l 2 penalty.Therefore, the procedure can greatly reduce computational complexity for feature selection problems, since strictly convex penalties like a ridge are easier to solve and their estimators have an explicit form.
In addition, the SPSP algorithm can be easily extended to handle selection problems in a wide range of models, including graphical models, generalized linear models, and Cox's proportional hazards models.The penalized likelihood approach, which obtains a sparse estimate by solving an objective function consisting of a likelihood and a penalty function, is usually applied in these models.Consequently, we can apply the SPSP algorithm to penalized likelihood esti-mators in a similar fashion.A simulation example using SPSP on Gaussian graphical models is provided in Appendix 9.3.

Consistent Feature Selection
In this section, we discuss the theoretical properties of the SPSP procedure with lasso for feature selection in a modern high-dimensional regime where p >> n.
Here we limit our efforts to linear regression, although the SPSP procedure is generally applicable to other selection problems as well.The technical proofs of all the lemmas and theorems in this section are available in Appendix 9.1.
"Consistent variable selection" for a procedure refers to the following property of its estimator Ŝ: In most of the existing literature, it is only possible to achieve feature selection consistency if the tuning parameter is restricted to a specific interval.Moreover, the widths of such intervals are usually so small that they converge to 0-see, for example, Fan and Li (2001), Fan and Peng (2004), Zhao andYu (2006), andZou (2006).It is also commonly recognized that under the high-dimensional setting, p >> n, the Gram matrix Σ = 1 n X T X is degenerate, which raises many difficulties in controlling the values of the lasso estimator.Therefore, some conditions on the design matrix are always required to establish the consistency of the feature selection.The most typical condition is probably the following irrepresentable condition (Meinshausen and Bühlmann, 2006;Yuan and Lin, 2007;Zhao and Yu, 2006;Zou, 2006): where η is a positive constant.Zhao and Yu (2006) showed that this condition is sufficient and almost necessary for lasso to be selection consistent.However, the condition is restrictive and difficult to verify in practice.Here, we will first show that the SPSP procedure is selection consistent under either a much weaker compatibility condition (Bühlmann and van de Geer, 2011) or the restricted eigenvalue condition (Bickel, Ritov and Tsybakov, 2009) if we can bound the tuning parameter to an interval of constant width, rather than one that is converging to zero.We further show that under a weak identifiability condition, the SPSP procedure achieves selection consistency for almost all values of the tuning parameter, i.e. the entire solution path.The weak identifiability condition is still weaker than the irrepresentable condition.We first introduce the following compatibility condition: Assumption 1. Compatibility Condition (Bühlmann and van de Geer, 2011;van de Geer, 2007).For some constant φ > 0 and for any vector ζ satisfying ||ζ S c || 1 ≤ 3||ζ S || 1 , the following compatibility condition holds: where s = |S| is the dimension of β S .
The compatibility condition is based on the fact that the bias of the lasso estimator ζ = βL − β * satisfies ||ζ S c || 1 ≤ 3||ζ S || 1 with a probability close to 1 (Bickel, Ritov and Tsybakov, 2009;Bühlmann and van de Geer, 2011).Hence we can restrict ourselves to such vectors in the condition.Several similar assumptions have also been proposed to establish the consistency property of the lasso, such as the restricted eigenvalue condition (Bickel, Ritov and Tsybakov, 2009), the restricted isometry condition (Candes and Tao, 2005), and the coherence condition (Bunea, Tsybakov and Wegkamp, 2007).The relations among these conditions can be found in Bühlmann and van de Geer (2011).Under the compatibility condition, we can bound both the bias and the prediction error of the lasso: Lemma 1. (Bühlmann and van de Geer, 2011) Suppose that the compatibility condition holds, and let λ 0 = 2σ t 2 +2 log p n for any t > 0. Then for λ ≥ 2λ 0 , we have with a probability of at least 1 − 2e −t 2 /2 .This lemma implies the bound for the prediction error and and the following bound for the l 1 -error of the lasso estimator: The compatibility condition required to bound the above errors is substantially weaker than the irrepresentable condition, which is necessary for achieving consistent variable selection under the currently used framework of choosing a single λ.Bühlmann and van de Geer (2011) showed that the irrepresentable condition actually implies the compatibility condition.
With the proposed SPSP procedure, we can accomplish consistent variable selection without the irrepresentable condition.This is because at each λ, we cluster the lasso estimators into two groups rather than labeling all the variables with non-zero coefficient estimates as important features.Consequently, we only need to bound the bias of the lasso estimators rather than shrink some coefficients to zeros.In what follows, we will first introduce some necessary notation and then present Theorem 1.The theorem shows that when λ is not too large, the SPSP procedure identifies the true relevant set S with a probability close to 1 with only the compatibility condition.
Let Ŝλ denote the important features identified at λ. Theorem 1 implies that P ( Ŝλ = S λ ) > 1 − 2e −t 2 .When the tuning parameter λ is bounded by min j∈S |β * j | > (1 + R)4λs/φ 2 , we recover S exactly with a probability close to 1: Further, we conclude that for larger values of λ, i.e. such that max The following theorem then follows immediately from the fact that our SPSP estimator is Ŝ = ∪ λ Ŝλ : In particular, when min j∈S |β * j | > (1 + R)2δ 0 , Theorem 2 suggests that the proposed SPSP procedure is consistent for variable selection under only the compatibility condition.With Theorem 2, we require that the tuning parameter λ is not larger than φ 2 Dmax 4s(1+R) , the lower bound of which can be obtained with prior information.When no such information is available, we would need SPSP estimators over larger values of λ to not select any spurious variables.This is easy to verify in practice, since we can simply examine whether any new variables enter into the relevant set for larger values of λ.However, in order to theoretically guarantee such behavior of the solution paths, we need an additional condition, which is still substantially weaker than the irrepresentable condition: Assumption 2. Identifiability Condition Let η > 0 be some constant.For any β = ( βS , βS C ), the following identifiability condition holds: The identifiability condition indicates that with the true set of relevant variables, we can approximate the noiseless response Xβ * at least as well as with any other set of variables under almost the same l 1 constraint.It is not difficult to verify that the condition is weaker than the irrepresentable condition.
Lemma 2. The irrepresentable condition implies the identifiability condition.
On the right hand side of (6), the coefficients of the irrelevant variables are set as β S C = 0.In fact, we can further weaken the identifiability condition by allowing β S c to be non-zero on the right-hand side of (6).Instead, we only require β S c 1 to be smaller than β S 1 up to a constant k.On top of that, we also relax the inequality (6) by taking κη βS c 1 on the right side.As a result, we obtain the following weak identifiability condition: Assumption 3. Weak Identifiability Condition Let η > 0 be some constant.For any β = ( βS , βS C ), then for k = 2 2s+Rs(s+1) and some κ that satisfies the following weak identifiability condition holds, where Θ( βS 1 , βS C 1 ) = {β = (β S , β S C ) : Henceforth, we refer to the preceding weak identifiability condition with constants k and κ as W IC(k, κ).Apparently, W IC(0, 0) is simply the identifiability condition in (6), and W IC(k, κ) always implies W IC(k ′ , κ ′ ) for k ′ > k, κ ′ > κ.Therefore Assumption 3 is always weaker than Assumption 2. The above assumption ensures that when λ is large, the lasso estimates for the zero coefficients will not be much larger than those for the non-zero coefficients, so that we will not have any false positive signals from our SPSP procedure in Ŝ.We combine this consideration with Theorem 1 to obtain the following result for the entire solution paths under the compatibility condition and W IC(k, κ): Theorem 3. Suppose that under the compatibility condition and W IC(k, κ) Then the SPSP procedure over λ When the true values of the coefficients are of a higher order than log p n , it follows immediately from Theorem 3 that the asymptotic probability of identifying the true relevant set S is 1.Here we only need the tuning parameter λ to be not too small: λ > 2λ 0 ; unlike the existing literature, we do not require the tuning parameter λ to be in a specific region.In fact, we do not need consistent variable selection for any value of λ.We only need to control the bias of the lasso estimators for smaller λs and to control the l 1 norm of the lasso estimators of those zero coefficients for larger λs; both of these results require weaker conditions than achieving selection consistency at certain values of λ.By combining these results, the SPSP procedure can accomplish feature selection consistency under substantially weaker conditions, without a proper choice of the tuning parameter.
The theoretical results for other penalty functions can be developed following the paradigm as described above.We can derive Theorem 2 given the bias of the penalized estimators and the corresponding condition.For the non-convex penalties, the bias is smaller than lasso.For l 2 penalty, Shao and Deng (2012) gives the bias and convergence rate for high-dimensional ridge regression.To obtain Theorem 3, we need a similar identifiability condition as those in ( 6) and ( 7).However, the constraint on the l 1 norm needs to be replaced by the constraint on l 2 norm for ridge regression, and by the corresponding function form for the non-convex penalties.
For all simulation examples, we conduct the SPSP procedure on the solution paths for lasso, adaptive lasso, SCAD, and MCP.One compelling advantage of the proposed SPSP procedure is that it can be applied for the l 2 penalty.Therefore, we also implement the SPSP algorithm with ridge regression for these examples.We compute the solution paths for convex penalties with the R package glmnet and those for SCAD and MCP with the R package ncvreg.All of the solution paths are computed over a grid of K = 100 values of the tuning parameter λ.
In Tables 2-5, we compare the performance of the proposed SPSP procedure with the 10-fold cross-validation, generalized cross-validation, AIC, BIC, extended BIC and stability selection over 500 replicates for each setup.We record the following measures: FP, the number of false positives, FN, the number of false negatives, and ME, the Model Error = ( β − β * ) T Ω( β − β * )/σ 2 , where Ω is the sample covariance of X.We report the mean values of false positives and false negatives along with the median of the model error, because the distribution of model error values is heavily skewed.We also report the standard errors for the above measures in parentheses.
Table 2 summarizes the results for M1, and we can observe that crossvalidation, generalized cross-validation, AIC, and BIC on all the penalties tend to select too many variables in the model, as evidenced by the extremely large false positive numbers.The performance of the extended BIC here is comparable to that of SPSP, since SPSP produces a smaller number false negatives and slightly larger number of false positives.Stability selection performs well for adaptive lasso, but yields large false negative values for the other three penal-ties, especially the two non-convex ones: it misses two of the three signals on average for MCP and SCAD.The SPSP procedure provides the smallest model error for all penalties except for adaptive lasso, where the model error for SPSP is slightly larger than that for stability selection.
The simulation results for M2 are summarized in Table 3.We find that our SPSP approach has a more significant advantage over all of the other approaches when p >> n, compared to the results for M1.The SPSP procedure produces much smaller false positive values than cross-validation, generalized cross-validation, AIC, and BIC for lasso and adaptive lasso, while its false negative values are close to those for the other approaches.On the other hand, the extended BIC and stability selection often identify 1 or no important signals, although their false positive values are slightly smaller than the false positive value for SPSP.Here cross-validation has a similar performance to SPSP for the non-convex penalties, but it performs rather poorly for lasso and adaptive lasso.Finally, the performance of SPSP with ridge regression is very close to that of the other penalties, which proves that even the l 2 penalty can be effective in feature selection with the SPSP procedure.
Table 4 presents the simulation results for M3, where the true model is very sparse and all of the important variables are correlated.We observe that compared to cross-validation, generalized cross-validation, AIC, and BIC, the SPSP procedure can dramatically improve the selection accuracy by selecting fewer irrelevant variables.We also observe that stability selection identifies fewer than one of the six signals with the non-convex penalties.Moreover, SPSP produces the smallest model error among all the approaches for all of the penalties, although in this case, the extended BIC also shows competitive performance.
The results for M4 are shown in Table 5.We observe that when models are misspecified, the SPSP procedure enjoys a substantially better performance for all four penalties in both selection accuracy and model error compared to the other selection criteria, except that the performance of cross-validation is similar to SPSP for the two non-convex penalties.
For each model in the simulation studies, we further select one example to illustrate the advantage of the SPSP approach on lasso solution paths.We plot these solution paths and the partitions using SPSP in Figure 4.
In summary, the proposed SPSP approach provides the best or close-to-best performance for all of the penalties in all simulation examples.Generally speaking, cross-validation, generalized cross-validation, AIC, and BIC tend to select too many spurious variables, while the extended BIC and stability selection tend to ignore important features and in some cases even fail to identify any useful signals.Moreover, both stability selection and cross-validation cost significantly more computation time than SPSP.In our simulation studies, the computational cost of stability selection is 30-60 times that for SPSP.Finally, the performance of SPSP with ridge regression is quite similar to the other penalties, also generally outperforming the other selection criteria.This establishes the promise of potential applications of selection using the l 2 penalty, a unique feature of SPSP.

Data Analysis
In this section, we apply the proposed SPSP procedure along with the other selection criteria to analyze the glioblastoma (GBM) gene expression data from The Cancer Genome Atlas (TCGA) consortium (https://xenabrowser.net/datapages/ ).The goal of the analysis is to identify the highly informative genes regarding the survival time of glioblastoma.The two gene expression data sets we obtained were both measured experimentally by the University of North Carolina TCGA genomic characterization center.In our analysis, we use the logarithm of the survival time as the response variable.The sample sizes of the two data sets are 428 and 91, after removing one common sample from the two data sets.We use the larger data set as our training set and the smaller one as the testing set.We first screen the 17814 genes with the training data using sure independence screening (Fan and Lv, 2008) to identify the 1000 genes which are most correlated with the response (Wang et al., 2011).We then apply the proposed SPSP approach and the other methods to the training set and evaluate the prediction errors of the responses using the test data.
Table 6 shows the number of genes selected in the training set and the corre-  sponding mean prediction error in the test set for each method.We notice that although the SPSP for lasso selects only 4 genes, it yields the smallest prediction error among all the penalty-criterion combinations.On the other hand, crossvalidation, generalized cross-validation, AIC, and BIC for lasso all select too many irrelevant genes and therefore produce larger prediction errors.Both the extended BIC and stability selection select fewer variables and produce larger prediction errors.In fact, the extended BIC fails to recognize any signals for any of the penalties, and stability selection fails to recognize any signals for the non-convex penalties.
Two genes identified by the proposed SPSP approach for lasso have been proved to be important by previous studies.MAGEC2, the Melanoma-associated antigen C2, a member of the MAGE gene family, is expressed in a wide variety of tumor types, including breast cancer (Song et al., 2016) and glioma (Sasaki et al., 2001).Song et al. (2016) further demonstrated that the gene expression levels of MAGEC2 are highly positively correlated with the phosphorylated STAT3, the signal transducer and activator of transcription 3, in human tumor tissues, and the STAT3 activation has been shown to affect multiple endoge- nous negative regulators such as PIAS3 (protein inhibitor of activated STAT3) in glioblastoma multiforme tumors.Therefore, the results of our present analysis provide additional evidence regarding the impact of MAGEC2 on glioblastoma behavior besides the work of Song et al. (2016), which may motivate scientists to conduct more experiments to explore the direct relationship between MAGEC2 and the human glioma.Moreover, the SPSP procedure additionally identifies TMEM49, also named VMP1, vacuole membrane protein 1, the gene which encodes an important protein in the process of autophagy.Ying et al. (2011) identified the gene as the direct and functional target of  and investigated the associations between the expression of the gene and hepatocellular carcinoma (HCC).Lai et al. (2015) demonstrated that miR-210 is a potential serum biomarker in the diagnosis and prognosis of glioma, and it is also worth conducting further experiments to establish the connection between the expression of this gene and glioblastoma.Finally, our SPSP approach is the only method which can identify both MAGEC2 and THEM49 without selecting more than 100 genes.

Discussion
We have proposed a novel selection procedure for the penalized likelihood approach based on the entire solution paths.By utilizing estimators over all values of the tuning parameter, we can obtain better selection accuracy than the commonly used approach of selecting only one tuning parameter based on certain criteria.Moreover, the proposed SPSP procedure also achieves selection consistency under conditions that are substantially weaker than the irrepresentable condition, which is almost necessary under the framework currently in usage.
Another advantage of SPSP is that we can now carry out selection with a strictly convex l 2 penalty.Although the present paper mainly focuses on feature selection for linear models, the SPSP procedure can easily be applied to most selection problems with one or more tuning parameters.In Section 3 of the appendix, we include a simulation study on estimating high-dimensional Gaussian graphical models with the SPSP approach to illustrate SPSP's potential for other selection problems.
With this study, we hope to initiate a discussion of how to better apply information contained in entire solution paths.For example, we can rank the importance of features by exploring the differences between the behaviors of the solution paths for important features and those for spurious ones.It is also possible to develop an inference procedure and quantify the uncertainty of selection results based on the entire solution paths.In addition, it might be interesting to see whether the solution paths for important features differ from those for irrelevant ones in any other manner besides the magnitude of the estimators.

Code
All source code for the simulations and real data analysis can be found on GitHub: https://github.com/yliu433/r-spsp.

Technical proofs
Proof of Theorem 1.By Lemma 1, with probability at least 1−2e −t 2 /2 , we have and similarly for any The above inequality follows immediately from C i λ under > R and To verify the partitioning rule, we have The last inequality follows from the fact that R = 1 + C.

Proof of Theorem 2. A sufficient condition for Ŝλ
Then the theorem follows immediately from Theorem 1 and the fact that Ŝ = ∪ λ Ŝλ .
Proof of Lemma 2. Note that By the irrepresentable condition, there is a η > 0 such that sign Therefore, the identifiability condition holds if the irrepresentable condition holds.
To prove Theorem 3, we introduce the following two lemmas.
Lemma 3.Under W IC(k, κ), the following inequality holds for the lasso solu- Proof.Since β = ( βS , βS C ) is the lasso solution, then for βS = arg min If the following inequality holds when βS Then it follows from the identifiability condition that Because we have concluded P ( 1 n 2ε T X ∞ < λ 0 ) > 1−2e −t 2 , when βS C 1 > k βS 1 , we have The last inequality follows from λ > λ 0 Proof of Theorem 3. The theorem follows directly from Theorem 2 and Lemma 4.

Simulation studies on sensitivity SPSP to the value of R
The constant R in the proposed SPSP algorithm controls the order of the magnitude in the partitioning rule.Here we examine the sensitivity of the SPSP procedure with respect to the choice of the constant R. Note that in the numerical studies we conducted, these constants are data adaptive rather than given arbitrarily.
In particular, we select a sequence of the numbers from 1 to 10 with the increment 0.5, as the candidates of the constant R. We use each number as the constant R in the proposed SPSP algorithm on lasso for all the models, and record the means of the false positive rates (F P R = F P/# of true zero features) and the false negative rates (F N R = F N/# of true nonzero features) for each value of R.
The results are shown in Figure 5.We observe that the means of the FPR and FNR are relatively stable across different choices of the constant R. Note that the vertical line in each graph represents the mean of the values of R estimated from the data set.It illustrates that the selection results of the SPSP algorithm are not so sensitive to the choice of the constant R as long as R stays within a reasonable range.

Simulation study on the Gaussian graphical model
Besides feature selection for linear models, the SPSP procedure introduced in the paper can be widely applied for other selection problems under the framework of penalized likelihood estimation.Here we simply present a simulation study to illustrate the performance of the SPSP algorithm for the Gaussian graphical model.
We compare the performance of the proposed SPSP algorithm with the graphical models selected by BIC and the EBIC.Here the details about using BIC and the EBIC to choose the tuning parameter in the graphical models are described in Foygel and Drton (2010).We apply the R package glasso to solve the graphical lasso estimators and apply the package qgraph to select the graphical lasso models by BIC and the EBIC.Note that the grid of the tuning parameters in the simulation is generated automatically by the function glassopath.We report the mean and the standard error of the number of the false positives  (FP), the number of the false negatives (FN) of the SPSP algorithm, BIC and the EBIC over 100 replicates in Table 7.We observe that the BIC tends to include too many zero dependencies (high FP value) while the EBIC missed all the nonzero dependencies (high FN value) in the model.Compared with the results of these two criteria, the SPSP algorithm has a much better performance in terms of selection accuracy, which selects most of the nonzero dependencies without adding many zero dependencies in the model.

Fig 1 .Fig 2 .Fig 3 .
Fig 1.Left: The lasso solution paths for the simulated example.The dashed lines are the paths of the 10 non-zero coefficients, while the black lines are the paths of the 30 zero coefficients The vertical lines represent the tuning parameters selected by different criteria.Right: The lasso solution paths for the non-zero coefficients, 1 and 3, and the zero coefficient, 2.Here CV is cross-validation, GCV is generalized cross-validation and EBIC is extended BIC.
ŝk = | Ŝk | be the number of variables in the estimated relevant set Ŝk , there are p − ŝk variables in the estimated irrelevant set Ŝc k .Hereafter, we define the gap between Ŝk and Ŝc k as the adjacent distance between b s − 1 and C s upper = ∞.In addition, let R = 1 + C. We also sort the absolute values of the estimators from lasso in ascending order to get b(1) , . . ., b(p) , and define the adjacent distances of the lasso estimators as D1 = b(1) , D2 = b(2) − b(1) , . . ., Dp = b(p) − b(p−1) .

Fig 4 .
Fig 4. Left Figures: The lasso solution paths for one simulated examples of these four models.The dashed lines are the paths of the non-zero coefficients, while the black lines are the paths of the zero coefficients The vertical lines represent the tuning parameters selected by different criteria.Right Figures: Partitions of the lasso solution paths of the same simulated examples using SPSP.

Fig 5 .
Fig 5.The mean of the FPR and FNR over 500 replicates over different choices of the constant R in SPSP on the Lasso.The vertical lines in the graphs are the average values of the R estimated from the data set.

Table 2
Simulation results for Model 1 over 100 replicates.

Table 3
Simulation results for Model 2 over 100 replicates.

Table 4
Simulation results for Model 3 over 100 replicates.

Table 5
Simulation results for Model 4 over 100 replicates.

Table 7
The mean of FP, FN values of the SPSP algorithm, BIC, and the EBIC over 100 replicates (Standard Error in the parentheses).The true model has 25 nonzero dependencies and 4925 zero dependencies.