Variable selection and estimation for accelerated failure time model via seamless- L 0 penalty

: Survival data with high dimensional covariates have been collected in medical studies and other ﬁelds. In this work, we propose a seamless L 0 (SELO) penalized method for the accelerated failure time (AFT) model under the framework of high dimension. Speciﬁcally, we apply the SELO to do variable selection and estimation under this model. Under appropriate conditions, we show that the SELO selects a model whose dimension is comparable to the underlying model, and prove that the proposed procedure is asymptotically normal. Simulation results demonstrate that the SELO procedure outperforms other existing procedures. The real data analysis is considered as well which shows that SELO selects the variables more correctly.


Introduction
Analyzing high-dimensional survival data has become an important topic in statistics, among which finding covariates with good predictive power of survival is a fundamental step. In the variable selection, penalized least squares procedures is an attractive approach. Penalized least squares procedures are used for variable selection and estimation which help predict estimators.
As a useful alternative to the Cox model [2], the AFT model [10] based on linear regression models has an intuitive form compared to Cox model. The AFT model with an unspecified error distribution has been studied commonly for right-censored data. Two approaches in this aspect have gained attractive attention. One uses the Kaplan-Meier estimator to obtain the ordinary least squares estimator. The other is the rank-based estimator, which is motivated by the score function of the partial likelihood. See for examples in [1,13,17]. Identifying significant factors with predictive power, many techniques for linear regression models have been extended to the Cox regression and the AFT model. Penalized methods have drawn extensive attentions, which are for imposing some penalties to the regression coefficients. By balancing the goodness of fit and model complexity, penalization approaches lead the complex models to a profile. There exists plenty of methods used in gene expression analysis with survival data; see for examples in [16,19]. Moreover, various penalization methods of consistent selection have also been proposed. Examples include the adaptive Lasso [19], the smoothly clipped absolute deviations (SCAD) [6], the minimax concave penalty (MCP) [20] and the bridge penalty. It has been shown that the bridge penalty had the oracle estimation in the linear regression models having divergent number of covariates. For the AFT models, there also exists much literature (e.g., [7,9,22]). To name but a few, Huang et al. [7] considered the regularization approaches for estimation in the AFT model with high-dimension covariates based on Stute's weighted least squares method. Huang and Ma [8] considered variable selection for AFT model with bridge method. Wang and Song [18] applied adaptive lasso to the AFT models. In recent years, there are still a lot of studies on the AFT models. For example, Chai et al. [3] considered a set of low-dimensional covariates of main interest and a set of high-dimensional covariates that may also affect survival under the accelerated failure time model. Choi and Choi [4] proposed the logistic-kernel smoothing procedure for the semi-parametric AFT model with high-dimensional right-censored data. Li et al. [12] proposed a unified Expectation-Maximization approach combined with the L 1 -norm penalty to perform variable selection and parameter estimation simultaneously in the accelerated failure time model with right-censored survival data of moderate sizes. This article is motivated by the need for considering a seamless-L 0 (SELO) penalty [5] under the AFT model, which is a smooth function similar to the L 0 penalty. Under appropriate conditions, we show that the SELO selects a model whose dimension is comparable to the underlying model and prove that the proposed estimators is asymptotically normal. Monte Carlo simulations to evaluate the finite sample performance of the proposed procedure are computed. The proposed method is also demonstrated through an empirical analysis.
The rest of this paper is organized as follows. The AFT model is based on SELO penalization and computational algorithm are introduced in Section 2. In Section 3, we further propose an accurate variable selection for high dimensional sparse AFT model based on seamless L 0 . The root n consistency and the asymptotic normality of the resulting estimate are established. We simulate Monte Carlo simulation study to examine the finite sample performance of the proposed estimate in Section 4. A real data example is used to illustrate the proposed methodology in Section 5.

SELO estimation in the AFT model
Let T i be the logarithm of the failure time and X i be the p-dimensional covariate vector. The AFT model assumes where α is the intercept, β ∈ R p is an unknown vector of interest, and ε i is the random error. When T i is subject to right censoring, we can only observe (Y i , δ i , X i ), where Y i = min(T i , C i ), X i be the pdimensional covariate vector for the ith row of the n × p matrix X which is the covariate matrix, C i is the logarithm of the censoring time, and δ i = I{T i ≤ C i } is the censoring indicator. We assume that (Y i , δ i , X i ), i = 1, . . . , n, come from the same distribution.
LetF n be the Kaplan-Meier estimator of the distribution function F of T .F n can be written aŝ where the w i 's are the jumps in the Kaplan-Meier estimator expressed as . . , n. w i 's are also called the Kaplan-Meier weights; see for examples in Stute and Wang [14]. Here Y (1) ≤ · · · ≤ Y (n) are the order statistics of Y i 's and δ (1) , . . . , δ (n) are the associated censoring indicators. Similarly, let X (1) , . . . , X (n) be the associated covariates of the ordered Y i 's. The weighted least square(WLS) loss function is (2.2) The weighted least square(WLS) objective function (2.2) can be written as Penalized regression problem has been studied extensively. LASSO is one of the most popular and widely studied L 1 penalty. But it has been proved that its estimator may be inconsistent for model selection. The smoothly clipped absolute deviations (SCAD) and the minimax concave penalty (MCP) are another two popular penalties. SCAD is a continuous penalty and its estimator has oracle property. MCP also performs well in variable selection , whose estimator is consistent. However, L 0 penalty directly penalizes the non-zero parameters,whose drawback is the difficulty of computing because of its discontinuity. Seamless-L 0 (SELO) was proposed in Dicker [5], which was explicitly designed to minic L 0 penalty. It has been found that SELO possessed good theoretical properties.
We now describe the variable selection for AFT model via SELO. Coordinate descent is introduced to solve this problem. We propose tuning parameter λ using cross-validation. The SELO penalized objective function is, where SELO(β j ) is defined as, and λ is tuning parameter. When λ is large, SELO may select small estimators. In the paragraph, λ is determined by Cross Validation. It is easy to see that when τ is enough small, p S ELO (β j ) ≈ λI{β j 0}, which is similar to L 0 penalty. To minimize (2.3), we utilize coordinate descent algorithm. Coordinate descent algorithm [21] has been widely used in penalized regression problem, which optimizes an objective function by calculating a single parameter at a time until convergence is reached. Dicker [5] described this algorithm for obtaining SELO estimators. The algorithm is formulated in terms of the tuning parameter λ. For a fixed value of λ, it can be implemented in the following steps.
Step 2. For the k-th iteration, we calculate the parameter from β k otherwise increase k to k + 1 and go to Step 2.

Theoretical properties
In this section, we prove the consistency and asymptotic normality of WLS estimator via SELO under some conditions. Following the notation of Stute [14,15], let H denote the distribution function of Y. Under the assumption of independence between T and C, 1 − H(y) = (1 − F(y))(1 − G(y)), where F and G are the distribution functions of T and C. Let τ Y , τ T and τ C be the endpoints of the support of Y, T and C. We putF Now, introduce the following sub-distribution functions: Under random censoring the limit variance becomes much more complicated. Let and We assume that 1+δ <M for some M<0 and δ>0.
Condition (A1) is usually used for the proof of consistency in Stute [14]. Condition (A2) restricts the size of λ and τ. Condition (A3) gives the bound of the eigenvalues of E(XX T ), which is used in Theorem 3.1. Conditions (A4) and (A5) are used for the proof the asymptotic normality of SELO estimators and are related to the Lindeberg condition of Lindeberg-Feller CLT.
The proof is put in the supplementary.

Numerical results
Let T be generated from T = Xβ + , where ∼ N(0, σ). The covariates X = (X 1 , ..., X p ) are standard normal. Here we set σ = 0.1. The censoring variables are generated as uniformly distributed U(0, C 0 ) and independent of the events, where C 0 is chosen to obtain the censoring rate 25% and 40%. Set two sample sizes n = 200 and n = 400. The tuning parameter λ is chosen by cross validation. For each value of n, we simulated 1000 independent datasets {(y 1 , x T 1 ), ..., (y n , x T n )}. For each dataset, we calculated estimates of β. For each estimatorβ, we recorded: the model size,Â = { j;β j 0}; an indicator of whether or not the true model was selected, I{Â = A}; the false positive rate, |Â− A|/|Â|; the false negative rate, |A −Â|/(p − |Â|); and the model error, (β − β * ) T (β − β * ). The column labeled "size", "rate","F+","F-" and "MSE" represent the above indicators. Results for SELO, LASSO, SCAD and MCP are summarized in the tables. Furthermore, we use the V-fold cross-validation to determine the tuning parameter. The CV score is In this article, we set V = 5.

Simulation I
The example was conducted with p = 8 and set β = (3, 1.5, 0, 0, 1, 0, 0, 0) ∈ R 8 , where Table1 summarizes the variable selection results based on SELO, LASSO, SCAD and MCP when censoring rates are 25% and 40%. Overall, SELO performs better than other three methods, which selects the correct model more frequently. For instance, when the censoring rate is 25%, SELO selects the true model most accurately. The true model size is 3 and the average size from SELO is 3.43. LASSO performs worse than other three methods both in model size and correct rate. LASSO, SCAD and MCP select model with average size 4.12, 4.05 and 4.04 and select the correct model in 42%, 46.7% and 47%. Similar results perform when the censoring rate is 40%. But we can easily see that the situation when the censoring rate is 25% is better than that when the censoring rate is 40%. When n increases, we can see that the results are better. For instance, SELO selects 3.07 variables when n = 400, which is more accurate compared to 3.43 when n = 200. Other indicators can also prove it.

Simulation II
The example was conducted with p = 50 and set β = (3, 1.5, 0, 0, 2, 0, 3, 0, 0, 2, 0, . . . , 0) ∈ R 50 , the rest of β is zero. Other settings were similar to the case in simulation I and the results are listed in Table 2. It is easy to see that SELO remains better performance compared to other three methods. The model size from SELO is 5.68, which is the closest to the true model when the censoring rate is 25% compared to 9.64, 9.27 and 8.86 for LASSO, SCAD and MCP. And it also performs better in correct rate and other indicators. For instance, SELO selects 52% correct variables which is better compared to 3%, 5% and 7% for LASSO, SCAD and MCP. The indicators "F+" and "F-" of SELO are the smallest among four methods. Similar results perform when p = 50 compared to p = 8. It can be concluded that the results are worse when the censoring rate is 40%. However, when n increases from 200 to 400, SELO selects more correct models.

Simulation III
The example is set under 25% and 40% censoring, and we also estimate the mean estimated variance across 1000 simulated datasets when n is 200 and 400. From the Tables 3 and 4, we see that SELO with tuning over τ ∈ {0.001, 0.01, 0.1, 0.5} seems to give better variance when compared to SELO with τ = 0.01 fixed. We can see that the 1000 estimators remain stable whenever censoring rate are 25% and 40% and the simulation results perform better when n increases to 400.

PBC Data
PBC data was collected in the Mayo Clinic trial of primary biliary cirrhosis of liver conducted between 1974 and 1984. A total of 424 PBC patients, referred to Mayo Clinic during that ten-year interval, met eligibility criteria for the randomized placebo controlled trial of the drug D-penicillamine. The first 312 cases in the data were participated in the randomized trial and contained complete data. The additional 112 cases did not participate in the clinical trial. After deleting the missing data, the remaining 276 datasets are used for the analysis. We consider 17 covariates: age, albumin, alk.phos, ascites, ast, bili, chol, copper, platelet, edema, hepato, protime, sex, spiders, stage, trt, trig.
Among 276 samples without losing data, we calculated Table 5. The optimal values of lambda with SELO is small, which is 0.008. And the optimal values of lambda that are chosen by CV for LASSO, SCAD and MCP are 0.011. LASSO selects 6 variables and SELO selects 3 variables. LASSO selects the genes including sex, heptato, bili, albumin, protime, stage. SCAD selects the same variables like MCP. However, SELO selects sex,albumin and protime. The variables selected by SELO are also contained by other three methods. Meanwhile, we also calculate the AIC (Akaike Information Criteria) AIC = n log(σ 2 ) + 2(d + 1) (d is the non-zero parameter andσ is the error between the estimator and true value) for four methods which show that SELO results better with smaller AIC. In Table 6, we also calculated p values for the coefficients of variables selected by SELO. We calculated p values for several steps. Firstly, we calculated t values called t k where t k =β/s and s is the sum of squares error of the estimators. Secondly, we found the value of t α/2 (n − K) where α = 0.05 and n − K is the degree. Finally, we calculated the values of P(T >t α/2 (n − K)) = P(T <t α/2 (n − K)).  Table 6 indicates the significance test results. From the results, we can see that the p value of sex, albumin and protime are all less than 0.05. The variables selected by SELO are all significant. Overall, SELO selected a simple model compared to another three methods.

Conclusions
Statistical analysis of failure time with high dimension covariates is an important topic. In this article, we investigate a new method (SELO) for the AFT model with high dimension covariates, for simultaneous variable selection and estimation. A real dataset (PBC) is analyzed and SELO selects some important covariates. Our numerical results indicate that SELO performs better than another three methods. In this article, we address the situation where p n and prove the oracle property under the condition p/n → 0. We allow both n and p to diverge but p goes to infinity more slowly than n. The situation where p is much larger than n will be extended in the future research.