BeSS : An R Package for Best Subset Selection in Linear, Logistic and Cox Proportional Hazards Models

We introduce a new R package, BeSS , for solving the best subset selection problem in linear, logistic and Cox’s proportional hazard (CoxPH) models. It utilizes a highly eﬃcient active set algorithm based on primal and dual variables, and supports sequential and golden search strategies for best subset selection. We provide a C++ implementation of the algorithm using an Rcpp interface. We demonstrate through numerical experiments based on enormous simulation and real datasets that the new BeSS package has competitive performance compared to other R packages for best subset selection purposes


Introduction
One of the main tasks of statistical modeling is to exploit the association between a response variable and multiple predictors. Linear model (LM), as a simple parametric regression model, is often used to capture linear dependence between response and predictors. The other two common models: generalized linear model (GLM) and Cox's proportional hazards (CoxPH) model, can be considered as the extensions of linear model, depending on the types of re-sponses. Parameter estimation in these models can be computationally intensive when the number of predictors is large. Meanwhile, Occam's razor is widely accepted as a heuristic rule for statistical modeling, which balances goodness of fit and model complexity. This rule leads to a relative small subset of important predictors.
The canonical approach to subset selection problem is to choose k out of p predictors for each k ∈ {0, 1, 2, . . . , p}. This involves exhaustive search over all possible 2 p subsets of predictors, which is an NP-hard combinatorial optimization problem. To speed up, Furnival and Wilson (1974) introduced a well-known branch-and-bound algorithm with an efficient updating strategy for LMs, which was later implemented by R packages such as the leaps (Lumley and Miller 2017) and the bestglm (McLeod and Xu 2010). Hofmann, Gatu, Kontoghiorghes, Colubi, and Zeileis (2020) proposed an exact variable-subset selection algorithm for linear regression and implemented it in R package lmSubsets. Yet for GLMs, a simple exhaustive screen is undertaken in bestglm. When the exhaustive screening is not feasible for GLMs, fast approximating approaches have been proposed based on a genetic algorithm. For instance, kofnGA (Wolters 2015) implemented a genetic algorithm to search for a best subset of a pre-specified model size k, while glmuti (Calcagno and de Mazancourt 2010) implemented a genetic algorithm to automatically select the best model for GLMs with no more than 32 covariates. These packages can only deal with dozens of predictors but not high-dimensional data arising in modern statistics. Recently, Bertsimas, King, and Mazumder (2016) proposed a mixed integer optimization approach to find feasible best subset solutions for LMs with relatively larger p, which relies on certain third-party integer optimization solvers. Alternatively, regularization strategy is widely used to transform the subset selection problem into computational feasible problem. For example, glmnet (Friedman, Hastie, and Tibshirani 2010;Simon, Friedman, Hastie, and Tibshirani 2011) implemented a coordinate descent algorithm to solve the LASSO problem, which is a convex relaxation by replacing the cardinality constraint in best subset selection problem by the L 1 norm.
In this paper, we consider a primal-dual active set (PDAS) approach to exactly solve the best subset selection problem for sparse LM, GLM and CoxPH models. The PDAS algorithm for linear least squares problems was first introduced by Ito and Kunisch (2013) and later discussed by Jiao, Jin, and Lu (2015), Huang, Jiao, Liu, and Lu (2017) and Ghilli and Kunisch (2017). It utilizes an active set updating strategy and fits the sub-models through use of complementary primal and dual variables. We generalize the PDAS algorithm for general convex loss functions with the best subset constraint, and further extend it to support both sequential and golden section search strategies for optimal k determination. We develop a new package BeSS (best subset selection, Wen, Zhang, Quan, and Wang 2020) in the R programming system (R Core Team 2020) with C++ implementation of PDAS algorithms and memory optimized for sparse matrix output. This package is publicly available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package= BeSS. We demonstrate through enormous datasets that BeSS is efficient and stable for high dimensional data, and may solve best subset problems with n in 1000s and p in 10000s in just seconds on a single personal computer.
The article is organized as follows. In Section 2, we provide a general primal-dual formulation for the best subset problem that includes linear, logistic and CoxPH models as special cases. Section 3 presents the PDAS algorithms and related technical details. Numerical experiments based on enormous simulation and real datasets are conducted in Section 4. We conclude with a short discussion in Section 5.

Primal-dual formulation
The best subset selection problem with the subset size k is given by the following optimization problem: where l(β) is a convex loss function of the model parameters β ∈ R p and k is a positive integer. The L 0 norm β 0 = p j=1 |β j | 0 = p j=1 1 β j =0 counts the number of nonzeros in β. It is known that the problem (1) admits non-unique local optimal solutions, among which the coordinate-wise minimizers possess promising properties. For a coordinate-wise minimizer β , denote the vectors of gradient and Hessian diagonal by respectively. For each coordinate j = 1, . . . , p, write l j (t) = l(β 1 , . . . , β j−1 , t, β j+1 , . . . , β p ) while fixing the other coordinates. Then the local quadratic approximation of l j (t) around β j is given by which gives rise of an important quantity γ j of the following scaled gradient form Minimizing the objective function l Q j (t) yields t * j = β j + γ j for j = 1, . . . , p. The constraint in (1) says that there are p − k components of {t * j , j = 1, . . . , p} that would be enforced to be zero. To determine which p − k components, we consider the sacrifices of l Q j (t) when switching each t * j from β j + γ j to 0, which are given by Among all the candidates, we may enforce those t * j 's to zero if they contribute the least total sacrifice to the overall loss. To realize this, let ∆ [1] ≥ · · · ≥ ∆ [p] denote the decreasing rearrangement of ∆ j for j = 1, . . . , p, then truncate the ordered sacrifice vector at position k. Combining the analytical result by (3), we obtain that In (6), we treat β = (β 1 , . . . , β p ) as primal variables, γ = (γ 1 , . . . , γ p ) as dual variables, and ∆ = (∆ 1 , . . . , ∆ p ) as reference sacrifices. Next we provide their explicit expressions for three important statistical models.
Case 1: Linear regression. Consider the LM y = Xβ + ε with design matrix X ∈ R n×p and i.i.d. errors. Here X and y are standardized such that the intercept term is removed from the model and each column of X has √ n norm. Take the loss function l(β) = 1 2n y − Xβ 2 . For j = 1, . . . , p, it is easy to obtain where X (j) denotes the jth column of X, so Case 2: Logistic regression. Consider the GLM log(p(x)/(1 − p(x))) = β 0 + x β, x ∈ R p with p(x) = P(Y = 1|x). Given the data (x i , y i ) n i=1 with binary responses y i ∈ {0, 1}, the negative log-likelihood function is given by We give only the primal-dual quantities for β ∈ R p according to the L 0 constraint in (1), while leaving β 0 to be estimated by unconstrained maximum likelihood method. For j = 1, . . . , p, where p i = exp(β 0 + x i β )/(1 + exp(β 0 + x i β )) denotes the i-th predicted probability. Then, Case 3: CoxPH regression. Consider the CoxPH model λ(t|x) = λ 0 (t) exp(x β), x ∈ R p with an unspecified baseline hazard λ 0 (t). Given the data {(T i , δ i , x i ) : i = 1, . . . , n} with observations of survival time T i and censoring indicator δ i , by the method of partial likelihood (Cox 1972), the parameter β can be obtained by minimizing the following convex loss By writing ω i,i = exp(x i β )/ i :T i ≥T i exp(x i β ), it can be verified that so that γ j = −g j /h j and ∆ j = 1 2 h j (β j + γ j ) 2 for j = 1, . . . , p.
By (C1) and (C2), the primal variables β j 's and the dual variables γ j 's have complementary supports. (C3) can be viewed as a local stationary condition. These three conditions lay the foundation for the primal-dual active set algorithm we develop in this section.
Let A be a candidate active set. By (C1), we may estimate the k-nonzero primal variables by standard convex optimization: Givenβ, the g and h vectors (2) can be computed, with their explicit expressions derived for linear, logistic and CoxPH models in the previous section. The γ and ∆ vectors are readily obtainable by (4), (5) and (C2). Then we may check if (C3) is satisfied; otherwise, update the active and inactive sets by This leads to the iterative algorithm shown in Algorithm 1.
1. Specify the cardinality k of the active set and the maximum number of iterations m max . Initialize A to be a random k-subset of {1, . . . , p} and I = A c .
1. Specify the maximum size k max of the active set, and initialize A 0 = ∅.
Remark 1. The proposed PDAS algorithm is based on the primal-dual active set strategy first developed by Ito and Kunisch (2013), but different from their original algorithm in two main aspects. First, our PDAS algorithm is derived from the quadratic argument (3) and it involves the second-order partial derivatives (i.e., Hessian diagonal h). Second, our algorithm extends the original linear model setting to the general setting with convex loss functions.

Determination of optimal k
The subset size k is usually unknown in practice, thus one has to determine it in a data-driven manner. A heuristic way is using the cross-validation technique to achieve the best prediction performance. Yet it is time consuming to conduct the cross-validation method especially for high-dimensional data. An alternative way is to run the PDAS algorithm from small to large k values, then identify an optimal choice according to some criteria, e.g., Akaike information criterion (Akaike 1974, AIC) and Bayesian information criterion (Schwarz 1978, BIC) and extended BIC Chen 2008, 2012, EBIC) for small-n-large-p scenarios. This leads to the sequential PDAS algorithm shown in Algorithm 2.
To alleviate the computational burden of determining k as in SPDAS, here we provide an alternative method based on the golden section search algorithm. We begin by plotting the loss function l(β) as a function of k for a simulated data from linear model with standard Gaussian error. The true coefficient β = (3, 1.5, 0, 0, −2, 0, 0, 0, −1, 0, . . . , 0) and the design matrix X is generated as in Section 4.1 with ρ = 0.2. From Figure 1, it can be seen that the slope of the loss plot goes from steep to flat and there is an 'elbow' exists near the true number of active set, i.e., k = 4.
The solution path for the same data is presented at the bottom of Figure 1 for a better visualization on the relationship between loss function and coefficient estimation. When a true active predictor is included in the model, the loss function drops dramatically and the predictors already in the model adjust their estimates to be close to the true values. When all the active predictors are included in the model, their estimates would not change much as k becomes larger.
Motivated by this interesting phenomenon, we develop a search algorithm based on the golden section method to determine the location of such an elbow in the loss function. In this way, we can avoid to run the PDAS algorithm extensively for a whole sequential list. The golden section primal-dual active set (GPDAS) algorithm is summarized in Algorithm 3.

Computational details
The proposed PDAS, SPDAS and SPDAS algorithms are much faster than existing methods reviewed in Section 1. For the exhaustive methods like leaps and bestglm, they essentially deal with kmax k=1 C(p, k) sub-models in order to search for the best subset with size no more than k max . It is infeasible even when k max is moderate. That is why the greedy methods (e.g., glmuti) and the relaxed methods (e.g., glmnet) become popular. Our proposed algorithms belong to the greedy methods and their computational complexity is discussed below.
In general, consider one iteration in step (2) of the PDAS algorithm with a pre-specified k. Denote by N l the computational complexity for solving β on the active set; and denote by N g and N h the computational complexity for calculating g and h, respectively. The calculation of γ in steps ( The total number of iterations of the PDAS algorithm could depend on the signal-to-noise ratio, the dimensionality p, and the sparsity level k. The algorithm may usually converge in finite steps (otherwise capped by m max ). Denote by N P the complexity for each run of the PDAS algorithm, then the total complexity of the SPDAS and GPDAS algorithms are O(k max × N P ) and O(log(k max ) × N P ), respectively.
Case 1: Linear regression. Since h = 1, N h = 0. The matrix vector product in the computation of h takes O(n) flops. For the least squares problem on the active set, we use Cholesky factorization to obtain the estimate, which leads to N l = O(max(nk 2 , k 3 )). Thus the total cost of one iteration in step (2) is O(max(nk 2 , k 3 , n(p − k))), and the overall cost of the PDAS algorithm is the same since the number of iterations is often finite.
1. Specify the number of maximum iterations m max , the maximum size k max of the active set and the tolerance η ∈ (0, 1). Initialize k L = 1, and k R = k max .
2. For m = 1, 2, . . . , m max , do , then stop and denote k M as an elbow point, otherwise go ahead.
In particular, if the true coefficient vector is sparse with k p and n = O(log(p)), the cost of the PDAS algorithm is O(np), a linear time with respective to the size p. With an unknown k, we can choose an appropriate k max value, e.g., k max = n/ log(n), to speed up the SPDAS and GPDAS algorithms. Their costs become O(n 2 p/ log(n)) and O(np log(n/ log(n))), respectively. These rates are comparable with the sure independence screening procedure (Fan and Lv 2008) in handling ultrahigh-dimensional data. In fact, even if the true coefficient vector is not sparse, we could use a conjugate gradient (Golub and Van Loan 2012) algorithm with a preconditioning matrix to achieve a similar computational rate.
Case 2: Logistic regression. It costs O(p) flop to compute the predicted probabilities p i 's. Thus N g = O(np) and N h = O(np). We use the iteratively reweighted least squares (IRLS) for parameter estimation on the active set. The complexity of each IRLS step is the same as that of the least squares, so N l = O(N I max(nk 2 , k 3 )) with N I denoting the finite number of IRLS iterations. The total cost of one iteration in step (2) is O(max(np 2 , nk 2 N I , k 3 N I )).
Case 3: CoxPH regression. It costs O(np) flops to compute ω i,i 's. Assume the censoring rate is c, then N g = O(n 3 p(1 − c)) and N h = O(n 3 p(1 − c)). Like the coxph command from the survival package (Therneau 2015), we adopt the standard Newton-Raphson algorithm for the maximum partial likelihood estimation on the active set. Its difficulty arises in the computation of the inverse of the Hessian matrix, which is full and dense. The Hessian matrix has k 2 entries and it requires O(n 3 k(1−c)) flops for the computation of each entry. The matrix inversion costs O(k 3 ) via Gauss-Jordan elimination or Cholesky decomposition. Hence, for each Newton-Raphson iteration, the updating equation requires O(max(n 3 k 3 (1−c), k 3 )) flops. We may speed up the algorithm by replacing the Hessian matrix with its diagonal, which reduces the computational complexity per updating to O(max(n 3 k 2 (1 − c), k 3 )). Denote by N nr the number of Newton-Raphson iterations, then N l = O(N nr max(n 3 k 2 (1 − c), k 3 )) and the total cost of one iteration in step (2) is O(max(n 3 p 2 (1 − c), n 3 k 2 (1 − c)N nr , k 3 N nr )).

R package
We have implemented the active set algorithms described above into an R package called BeSS (best subset selection), which is publicly available from CRAN at https://CRAN.R-project. org/package=BeSS. The package is implemented in C++, using an Rcpp interface (Eddelbuettel and François 2011), with memory optimized using sparse matrix output and it can be called from R by a user-friendly interface.
The package contains two main functions, i.e., bess.one and bess, for solving the best subset selection problem with or without specification of k. In bess, two options are provided to determine the optimal k: one is based on the SPDAS algorithm with criteria including AIC, BIC and EBIC; the other is based on the GPDAS algorithm. The plot method for 'bess' objects generates plots of loss functions for the best sub-models for each candidate k, together with solution paths for each predictor. We also include predict methods for 'bess' and 'bess.one' to make prediction on the new data.

Numerical examples
In this section we compare the performance of our new BeSS package to other well-known packages for best subset selection: leaps, bestglm and glmulti. We also include glmnet as an approximate subset selection method and use the default cross-validation method to determine an optimal tuning parameter. All parameters use the default values of the corresponding main functions in those packages unless otherwise stated. In presenting the results of BeSS, bess.seq represents bess with argument method = "sequential" and bess.gs represents bess with argument method = "gsection", two different ways to determine the optimal parameter k. In bess.seq, we use AIC for examples with n ≥ p and EBIC for examples with n < p. We choose k max = min(n/2, p) for linear models and k max = min(n/ log(n), p) for logistic and CoxPH models.
All the R codes are demonstrated in Section 4.3. All computations are carried out on a 64-bit Intel machine with a single 3.30 GHz CPU and 4 GB of RAM.

Simulation data
We demonstrate the practical application of our new BeSS package on synthetical data under both low and high dimensional settings. For the low-dimensional data, BeSS has comparable performance with other state-of-the-art methods. For the high-dimensional data, while most state-of-the-art methods become incapable to deal with them, BeSS still performs fairly well. For an instance, BeSS is scalable enough to identify the best sub-model over all candidates efficiently in seconds or a few minutes when the dimension p = 10000.
We compare the performances of different methods in three aspects. The first aspect is the run time in seconds (Time). The second aspect is the selection performance in terms of true positive (TP) and false positive (FP) numbers, which are defined by the numbers of true relevant and true irrelevant variables among the selective predictors. The third aspect is the predictive performance on a held out test data of size 1000. For linear regression, we use the relative mean squares error (MSE) as defined by Xβ − Xβ * 2 / Xβ * 2 . For logistic regression, we calculate the classification accuracy by the average number of observations being correctly classified. For CoxPH regression, we compute the median time on the test data, then derive the area under the receiver operator characteristic curve (i.e., AUC) using nearest neighbor estimation method as in Heagerty, Lumley, and Pepe (2000).
We generate the design matrix X and the underlying coefficients β as follows. The design matrix X is generated with X (j) = Z j + 0.5 × (Z j−1 + Z j+1 ), j = 1, . . . , p, where Z 0 = 0, Z p+1 = 0 and {Z j , j = 1, . . . , p} were i.i.d. random samples drawn from standard Gaussian distribution and subsequently normalized to have √ n norm. The true coefficient β * is a vector with q nonzero entries uniformly distributed in [b, B], with b and B to be specified. In the simulation study, the sample size is fixed to be n = 1000. For each scenario, 100 replications are conducted .
Case 1: Linear regression. For each X and β * , we generate the response vector y = Xβ * + σ , with ∼ N (0, 1). We set b = 5σ 2 log(p)/n, B = 100b and σ = 3. Different choices of (p, q) are taken to cover both the low-dimensional cases (p = 20, 30, or 40, q = 4) and the high-dimensional cases (p = 100, 1000, or 10000, q = 40). For glmulti, we only present the result for p = 20 and p = 30 since it can only deal with at most 32 predictors. Note that, due to the lack of possibility to set the random seeds, the results from glmulti are not exactly reproducible and they are subject to small variations. Since leaps and bestglm cannot deal with high-dimensional case, we only report the results of glmnet, bess.seq and bess.gs. The results are summarized in Table 1. In the low-dimensional cases, the performances of all best subset selection methods are comparable in terms of prediction accuracy and selection consistency. However, the regularization method glmnet has much higher MSE and lower FP, which suggests that LASSO incurs bias in the coefficient estimation. In terms of computational times, both bess.seq and bess.gs have comparable performance with glmnet, which cost much less run times than the stateof-the-art methods. Unlike leaps, bestglm and glmulti, the run times of bess.seq and bess.gs remain fairly stable across different dimensionality.
In the high-dimensional cases, both bess.seq and bess.gs work quite well and they have similar performance in prediction and variable selection. Furthermore, their performances become better as p and q increase (from left to right in Table 1). On the other hand, glmnet has higher FP as p increases. In particular, when p = 10000 and only 40 nonzero coefficients are involved, the average TP equals 40 and the average FP is less than 3.06. In contrast, the average FP of glmnet increases to 30. As for the computational issues, both bess.seq and bess.gs seem to grow at a linear rate of p, but bess.gs offers speedups by factors of 2 up to 10 and more. Case 2: Logistic regression. For each x and β * , the binary response is generated by y = Bernoulli(P(Y = 1)), where P(Y = 1) = exp(x β * )/(1 + exp(x β * )). The range of nonzero coefficients are set as b = 10 2 log(p)/n, B = 5b. Different choices of p are taken to cover both the low-dimensional cases (p = 8, 10, or 12) and the high-dimensional cases (p = 100, 1000, or 10000). The number of true nonzero coefficients is chosen to be q = 4 for low-dimensional cases and q = 20 for high-dimensional cases. Since bestglm is based on complete enumeration, it may be used for low-dimensional cases yet it becomes computationally infeasible for high dimensional cases.
The simulation results are summarized in Table 2. When p is small, both bess.seq and bess.gs have comparable performance with bestglm, glmulti and glmnet, but have considerably faster speed in computation than bestglm and glmulti. In the high-dimensional cases, we see that all three methods perform very well in terms of accuracy and TP. Yet both bess.seq and bess.gs have much smaller FP than glmnet. Among them, the run time for bess.gs is around a quarter of that for bess.seq and is similar to that for glmnet.

Case 3: CoxPH regression.
For each x and β * , we generate data from the CoxPH model with hazard rate λ(t|x) = exp(x β * ). The ranges of nonzero coefficients are set same as those in logistic regression, i.e., b = 10 2 log(p)/n, B = 5b. Different choices of p are taken to cover both the low-dimensional cases (p = 8, 10, or 12) and the high-dimensional cases (p = 100, 1000, or 10000). The number of true nonzero coefficients is chosen to be q = 4 for low-dimensional cases and q = 20 for high-dimensional cases. Since glmulti cannot handle more than 32 predictors, we only report the low dimensional result for glmulti.
The simulation results are summarized in Table 3. Our findings about bess.seq and bess.gs are similar to those for the logistic regression.

Real data
We also evaluate the performance of the BeSS package in modeling several real data sets. Table 4 lists these instances and their descriptions. All datasets are saved as R data objects and available online with this publication.
We randomly split the data into a training set with two-thirds observations and a test set with remaining observations. Different best subset selection methods are used to identify the best sub-model. For each method, the run time in seconds (Time) and the size of selected model (MS) are recorded. We also include measurements of the predictive performance on test data according to the metrics as in Section 4.1. For reliable evaluation, the aforementioned procedure is replicated for 100 times.
The modeling results are displayed in Table 5. Again in low-dimensional cases, bess has comparable performance with the state-of-art algorithms (branch-and-bound algorithm for linear models and complete enumeration algorithm and genetic algorithm for GLMs). Besides, bess.gs has comparable run time with glmnet and is considerably faster than bess.seq especially in high-dimensional cases.

Code demonstration
We demonstrate how to use the package BeSS on a synthesis data as discussed in Section 3.1 and a real data in Section 4.2. Firstly, load BeSS and generate data with the gen.data function.   We may call the bess.one function to solve the best subset selection problem with a specified cardinality. Then we can print or summary the 'bess.one' object. While the print method allows users to obtain a brief summary of the fitted model, the summary method presents a much more detailed description.

Df
The estimated coefficients of the fitted model can be extracted by using the coef function, which provides a sparse output with the control of argument sparse = TRUE. It is recommended to output a non-sparse vector when bess.one is used, and to output a sparse matrix when bess is used.  58.97 (6.85) Table 5: Results for the real data sets. Time stands for run time (CPU seconds), MS stands for the size of selected model. PE stands for mean prediction error in linear model; Acc stands for classification accuracy in logistic regression model; AUC stands for the integrated time-dependent area under the curve in CoxPH regression model.
To make prediction on new data, the predict function can be used as follows.
R> pred.one <-predict(fit.one, newdata = data$x) To extract the selected best model, we provide the lm, glm, or coxph type of object named the bestmodel in the fitted 'bess.one' object. Users could print, summary or predict this bestmodel object just like working with classical regression modeling. This would be helpful for statistical analysts who are familiar with lm, glm, or coxph functions. In practice when the best subset size is unknown, we have to determine the optimal choice of such sub-model size. The function bess provides two options: method = "sequential" corresponds to the SPDAS algorithm, and method = "gsection" corresponds to the GPDAS algorithm. Next we illustrate the usage of bess in the trim32 data. We first load the data into the environment and show that it has 18975 variables, a much larger number compared with the sample size 120.

R>
120 18975 Below is an example of running bess with argument method = "sequential", epsilon = 0 and other argument being default values. We use the summary function to give a summary of the fitted 'bess' object.
R> fit.seq <-bess(X, Y, method = "sequential", epsilon = 0) R> summary(fit.seq) As in the bess.one, the bess function outputs an 'lm' type of object bestmodel associated with the selected best model. Here the bestmodel component outputs the largest fitted model since we did not use any early stopping rule as shown in the argument epsilon = 0. Alternatively, we might use criteria like AIC to select the best model among a sequential list of candidate models. As shown above, the output of the bess function includes the AIC, BIC and EBIC values for best subset selection. Since the trim32 data is high dimensional, we opt to use the EBIC criterion to determine the optimal model size. Then we run the coef function to extract the coefficients in the 'bess' object and output the nonzero coefficients of the selected model. We can also run the predict function for a given newdata. The argument type specifies which criteria is used to select the best fitted model.
R> pred.seq <-predict(fit.seq, newdata = data$x, type = "EBIC") The plot routine provides the loss function plot for the sub-models with different k values, as well as solution paths for each predictor. It also adds a vertical dashed line to indicate the optimal k value as determined by EBIC. Figure 2 shows the result from the following R code.
R> plot(fit.seq, type = "both", breaks = TRUE, K = K.opt.ebic) Next we call the function bess with argument method = "gsection" to perform the GPDAS algorithm. At each iteration, it outputs the split information.
R> beta <-coef(fit.gs, sparse = TRUE) R> class(beta) Based on the fit.gs, we can predict for the new data via the predict function as follows.

Discussion
In this paper, we introduce the primal dual active set (PDAS) algorithm for solving the best subset selection problem under the general convex loss setting. The PDAS algorithm identifies the best sub-model with a pre-specified model size via a primal-dual formulation on feasible solutions. To determine the best sub-model over different model sizes, both sequential search and golden section search are proposed, i.e., SPDAS and GPDAS algorithms. We find that the GPDAS algorithm is especially efficient and accurate in selecting variables for highdimensional and sparse data.
The proposed algorithms are implemented with C++ through the new BeSS package in the R statistical environment. Package BeSS provides R users with a new and flexible way to carry out best subset selection for sparse LM, GLM and CoxPH models. It allows us to identify the best sub-model efficiently (usually in seconds or a few minutes) even when the number of predictors is extremely large, say p ≈ 10000, based on a standard personal computer. In both simulation and real data examples, it was shown that the BeSS package is highly efficient compared to other state-of-the-art methods.