Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent

We introduce a pathwise algorithm for the Cox proportional hazards model, regularized by convex combinations of (cid:96) 1 and (cid:96) 2 penalties (elastic net). Our algorithm ﬁts via cyclical coordinate descent, and employs warm starts to ﬁnd a solution along a regularization path. We demonstrate the eﬃcacy of our algorithm on real and simulated data sets, and ﬁnd considerable speedup between our algorithm and competing methods.


Introduction
Consider the usual survival analysis framework.We have data of the form (y 1 , x 1 , δ 1 ), . . ., (y n , x n , δ n ) where y i , the observed time, is a time of failure if δ i is 1 or right-censoring if δ i is 0. As in regression, x i is a vector of potential predictors (x i,1 , x i,2 , . . ., x i,p ).We further let t 1 < t 2 < . . .< t m be the increasing list of unique failure times, and j(i) denote the index of the observation failing at time t i .One potential problem of interest is to study the relationship between predictor variables and survival time.Commonly, the Cox proportional hazards model (Cox 1972) is used to approach this problem.The Cox model assumes a semi-parametric form for the hazard where h i (t) is the hazard for patient i at time t, h 0 (t) is a shared baseline hazard, and β is a fixed, length p vector.Inference is then made via the partial likelihood where R i is the set of indices, j, with y j ≥ t i (those at risk at time t i ).Inference made with the partial likelihood ignores all information between failure times.For ease of notation the above formula assumes that the y i are unique, however it can be suitably altered for the case of ties (see Section 2.5).By maximizing the partial likelihood, one can estimate β.The beauty of this approach is that it allows estimation of β while ignoring h 0 .For classical problems, with many more observations than predictors, the Cox model performs well.However, problems with p > n, lead to degenerate behavior; to maximize the partial likelihood, all of the β i are sent to ±∞.To combat this problem, Tibshirani (1997) proposed the use of an L1 (lasso) penalty in the Cox model.This both provides a well defined solution, and a solution with few nonzero β i .Even in the n > p case, if p is sufficiently close to n, this may better estimate β than the unpenalized Cox model.Gui and Li (2005) developed an algorithm to fit this model using Newton Raphson approximations and the lasso path solution to the penalized least squares problem provided by the adjusted LARS (least angle regression) solution of Efron et al. (2004).
More recently Zou and Hastie (2005) proposed the elastic net for linear regression; to maximize the likelihood subject to the constraint gives the lasso penalty.Park and Hastie (2007a) applied this to the Cox model and proposed an efficient algorithm to solve this problem along a path of c and α values.Their algorithm exploits the near piecewise linearity of the coefficients to approximate the solution at different constraints, then numerically maximizes the likelihood for each constraint via a Newton iteration initialized at the approximation.Goeman (2010a) also attacked this problem.He developed a hybrid algorithm, combining gradient descent and Newton's method.
In this paper we instead employ cyclical coordinate descent.This method has been applied to penalized regression and in particular, elastic net penalties; recently by Friedman et al. (2010), van der Kooij (2007) and Wu and Lange (2008).Friedman et al. (2010) also recognized the strength of employing warm starts to solve the problem along a path of constraint values.
We build on the work of Friedman et al. (2010) and develop a fast algorithm to fit the Cox model with elastic net penalties.The time-ordered structure of the partial likelihood required the development of special risk set updates for the procedure.In addition, we present a method for selecting a well behaved path of λ values.We further include a number of checks on deviance and step size to make efficient use of CPU time.We show via simulation that our algorithm is both efficient and stable for small and large problems.In time trials we show that it is significantly faster than the Hastie-Park algorithm and the Goeman algorithm.Our algorithm scales efficiently allowing us to solve much larger problems than previously possible.We also provide a publicly available R (R Development Core Team 2010) implementation in the package glmnet (Friedman et al. 2011), available from the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=glmnet.
In Section 2 we introduce our algorithm to fit the Cox model with elastic net penalties.In Section 3 we look at the stability of our algorithm and run time trials on simulated data and one real microarray example.

Algorithm
We again consider the survival framework of Section 1.For the time being we assume no ties in failure/censoring time.We wish to find β which maximizes subject to our constraint: Maximizing the partial likelihood is equivalent to maximizing a scaled log partial likelihood, We scale by a factor of 2/n for convenience.Hence, if we consider the Lagrangian formulation, our problem becomes where, β 2 i is known as the elastic net penalty.It is a mixture of the 1 (lasso) and 2 (ridge regression) penalties.The lasso penalty (Tibshirani 1996) tends to choose only a few nonzero coefficients.
While often desirable, this can cause problems.If two predictors are very correlated, the lasso will pick one and entirely ignore the other.
On the other hand, ridge regression scales all the coefficients towards 0, but sets none to exactly zero.This helps to regularize in problems with p > n, but does not give a sparse solution.However, ridge regression better handles correlated predictors.If two predictors are very correlated, ridge regression will tend to give them equal weight.
The elastic net combines the strengths of the two approaches.For fixed λ, as α changes from 0 to 1 our solutions move from more ridge-like to more lasso-like, increasing sparsity but also increasing the magnitude of all non-zero coefficients.With α = 0.95 (or even closer to 1), the elastic net behaves very similarly to the lasso, only removing degenerate behavior due to extreme correlations.

Basic algorithm
Our strategy for maximizing Equation 1 is very similar to the standard Newton Raphson algorithm.However, at each iteration instead of solving a general least squares problem, we solve a penalized reweighted least squares problem.
Let X denote the design matrix, β the coefficient vector, and η = Xβ.Let ˙ (β), ¨ (β), (η), (η) denote the gradient and Hessian of the log-partial likelihood with respect to β and η respectively.A two term Taylor series expansion of the log-partial likelihood centered at β has the form where η = X β.Simple algebra gives us and C(η, β) does not depend on β.
One difficulty arises in the computation of (η).Because this is a full matrix it would require computation of O(n 2 ) entries.In order to speed up the algorithm, we instead replace (η) by a diagonal matrix with the diagonal entries of (η).We denote the ith diagonal entry of (η) by w(η) i .We rely on the argument of Hastie and Tibshirani (1990, Chapter 8) that this substitution works because the optimal β will remain a fixed point of the algorithm, and the off diagonal entries of (η) are small in comparison to the diagonal.
The minimization in step 3 is done by cyclical coordinate descent, which will be described in Section 2.2.

Penalized least squares
We have reduced our problem to repeatedly solving the penalized, weighted least squares problem (2) Let M (β) denote the objective function in Equation 2. Now consider a coordinate descent step for minimizing M (β).Suppose we have estimates for β l for all l = k and would like to minimize our objective in β k .We compute the derivative and −1 if β k < 0 and, for the sake of completeness, 0 if β k = 0. From here, a simple calculation (Friedman et al. 2007) shows that the coordinate solution is given by and C k is the set of i with t i < y k (the times for which observation k is still at risk).
Thus, we solve for β k , combining our usual least squares coordinate wise solution with proportional shrinkage from the 2 penalty, and soft thresholding from the 1 .Applying Equation 4to the coordinates of β in a cyclic fashion until convergence minimizes Objective (2).This algorithm was proposed by van der Kooij (2007) for usual linear regression, and applied to logistic and multinomial generalized linear models by Friedman et al. (2010).

Pathwise solution
We will usually be interested in models for more than one value of λ.Toward this end, for fixed α, we compute the solutions for a path of λ values.We begin with λ sufficiently large to set β = 0, and decrease λ until we arrive near the unregularized solution.By employing warm starts this procedure is efficient and increases the stability of our algorithm.
In Equation 4 notice that if 1 n n i=1 w i (0)x ij z(0) i < αλ for all j, then β = 0 minimizes the objective M .Thus, we set our first λ to be We do not solve all the way to the unregularized solution.When p > n the unregularized solution is undefined ( β shoots off to ∞).Even near λ = 0 the solution is poorly behaved.
We have examples where the last few λ values account for more than 99% of the algorithm's runtime.We argue that it is acceptable to ignore solutions with λ near 0, as any reasonable model selection criteria (commonly cross-validation) will choose a much more regularized model.We set λ min = λ max , and compute solutions over a grid of m values between λ min and λ max , where λ j = λ max (λ min /λ max ) j/m for j = 0, . . ., m.In our implementation, the default value for m is 100.The default value for depends on whether or not n ≥ p; for n < p, we default to = 0.05, for n ≥ p, = 0.0001 Newton step algorithms can be unstable and have no convergence guarantee without step size optimization.This is only a problem if the inital parameter estimate is far from the optimal value.Because we have an exact starting solution at the beginning of our path and employ warm starts at each new λ, our initial estimate and solution at each λ are never far apart.Hence, algorithm is well behaved.For computational speed, we have opted against employing divergence checks in our implementation.Thus far, we have not come across any divergence problems.then we need only calculate the full sum, j∈R i e ηk , for the first index, i, in C k .For each subsequent i, we subtract off the contribution from observations with failure/censoring times between time i−1 and time i.Thus, the calculation is reduced to O(n).These updates explain, in part, the incredible increase in efficiency our algorithm sees over competing algorithms in section 3 for large n.

Weights and ties
In the above algorithm, we have assumed that every failure/censoring time was unique and gave equal weight to each observation in the partial likelihood.It is straightforward to include ties and assign different weights to observations.We use the Breslow approximation of the partial likelihood for ties (Breslow 1972) and the corresponding weight extension (integer weights correspond to repeated measurements).The partial likelihood becomes where D i is the set of indices of failure at time t i , ω i is the weight associated with observation i and d i = j∈D i ω j is the sum of weights at time t i .This changes w k and z k The rest of the algorithm remains the same.The addition of weights and ties does not seriously affect the computational cost.
Note, scaling the weights similarly scales the log-partial likelihood.In our implementation, to provide comparability between partial likelihoods with different weight vectors, we standardize the weights to sum to 1. Note, this is why we scale the log likelihood by 1/n in the "unweighted problem".

Deviance cutoff
In addition to the usual stopping criteria, we would like to terminate the algorithm early if, at some point, our model explains almost all of the variability in the observations.To this end, we define the deviance of a model with parameter β to be where saturated is the maximum log-partial likelihood when the η i may vary freely (whereas usually η i is constrained by η i = x i β) and (β) is the log-partial likelihood under β.We similarly define the null deviance to be where null = (0).Note, for all in Equations 9 and 10 we use the weighted log-partial likelihood from Section 2.5 (in the unweighted case we use weights 1/n for each observation).This definition of deviance parallels deviance defined for likelihood-based models.At a given step we terminate the algorithm early if more than 99% of the null deviance is explained by the model.That is, if A simple calculation shows that In the "unweighted" case (or actually equally weighted, 1/n case) this reduces to saturated = 0 and null = − m i=1 log |R i |.This stopping criterion is another safeguard against wasting time computing solutions for values of λ too small to sufficiently regularize the problem (in the case of n > p this cutoff will effectively never kick in).

Comparison
A number of other algorithms have recently been developed for the same problem.Gui and Li (2005) and Park and Hastie (2007a) both developed LARS-like algorithms, which combine approximate paths with Newton steps; though they differ in that Gui and Li (2005) use a LARS path inside each Newton approximation while Park and Hastie (2007a) use an approximate LARS path to initialize each step, then solve exactly via Newton Raphson.Goeman (2010a) instead combines gradient descent with Newton's method -using gradientlike steps until close to the solution, then quickly converging with Newton steps.
Our algorithm also uses a Newton Raphson-like method; however unlike other methods, by employing coordinatewise updates we take advantage of the sparsity of each solution.Rather than solving for all coordinates, we iterate over an active set.Because the number of nonzero coordinates is never greater than the number of observations, in the n << p situation we gain significant efficiency within each Newton iteration.Furthermore, efficient computation of risk sets and updating formulas also gains us efficiency in calculating weights and pseudo-responses for each Newton step.

Timings
In this section we compare the runtime of our algorithm, coxnet, with two competing algorithms, coxpath and penalized.Because coxnet, coxpath and penalized all take different elastic net penalty paths (coxpath and penalized fix λ 2 ) all comparisons were run using only the lasso penalty (α = 1).For our algorithm, we also consider the relationship between runtime and both number of observations and covariates.All calculations were carried out on an Intel Xeon 3 GHz processor.
We generated standard Gaussian predictor data, X, n observations on p predictors.We fixed pairwise correlation between any 2 predictors X i and X j at ρ, for several values of ρ.We then generated "true" survival times according to where β j = (−1) j exp (−2(j − 1)/20), Z ∼ N (0, 1), and k is chosen so that the signal-to-noise ratio is 3.0.Similarly, we generated censoring times by The recorded survival time was set to be the minimum of the "true" survival and censoring times, T = min{Y, C}.The observation was said to be censored if C < Y , the censoring time preceeded the "true" survival time.
Table 1 shows runtime comparisons between our coordinate descent algorithm, coxnet, the combination gradient descent-Newton Raphson method, penalized (Goeman 2010a) from the package penalized (Goeman 2010b), and the LARS-like algorithm, coxpath (Park and Hastie 2007a) from the package glmpath (Park and Hastie 2007b).All of these algorithms were implemented in the R language.coxnet does all computation in Fortran, while coxpath does some computation in R, but frequently calls Fortran routines for numerical optimization.penalized does all computation in R. To make the algorithms more comparable, we made coxnet solve for the path of λ chosen by coxpath with λ min = 0.05λ max fixed for both.Unfortunately penalized does not allow for a user indicated λ path.Instead we used the same λ min as the other algorithms and let penalized choose a path with number of λ values equal to that of the coxpath path.In Table 1 we see that coxnet is significantly faster than coxpath and penalized.In particular, it handles large n more efficiently than both and large p much more efficiently than penalized (though somewhat more efficiently than coxpath as well).We also note a slight curiousity: coxnet runs faster for more highly correlated predictors.This is counter-intuitive and the opposite is seen in Friedman et al. (2010) for cyclical coordinate descent applied to standard linear and logistic regression.

Scaling in n and p
In addition to comparing our algorithm to coxpath, we would like to know how it scales in n and p.We simulated data as before, with ρ = 0.5 fixed, and a variety of n and p.For each n, p pair we solved for a path of 100 λ values with λ min = 0.05λ max .Figure 1 shows runtimes for fixed p as n changes, and for fixed n as p changes.From these plots we can see that the runtime is relatively linear in n and p.

Cross validation
Once we have calculated a path of solutions it is necessary to select an optimal λ.Model selection is often done by k-fold cross validation -splitting the data in k pieces, using k − 1  We use a technique proposed in van Houwelingen et al. (2006).We split our data into k parts.Our goodness of fit estimate for a given part i and λ is where −i is the log-partial likelihood excluding part i of the data, and β −i (λ) is the optimal β for the non-left out data, found from maximizing −i + λ||β|| 1 .Our total goodness of fit estimate, ĈV(λ), is the sum of all Ĉ V i (λ).We choose the λ value which maximizes ĈV(λ).By using (11) -subtracting the log-partial likelihood evaluated on the non-left out data from that evaluated on the full data -we can make efficient use of the death times of the left out data in relation to the death times of all the data.

Real data
We also compare timings on a real data set, Alizadeh et al. (2000): gene-expression data in lymphoma patients.There were 240 patients with measurements on 7399 genes.We used the lasso penalty (α = 1), and the path of λ values was chosen as in the simulations.We ran 10-fold cross validation over these paths to find the optimal value of λ. 10-fold cross validation run times were 574.5 seconds for coxnet, 18998.7 seconds for penalized, and 4796.3 seconds for coxpath.Without cross validation, times were 54.9 seconds for coxnet, 2524.9 seconds for penalized and 679.5 seconds for coxpath.As in the simulation results, our method shows a substantial run time improvement.

Code example
Our implementation is straightforward to run.We demonstrate coxnet on the data from Section 3.3.We first load the data and set up the response.

Discussion
We have shown cyclical coordinate descent to be a very efficient algorithm for maximizing the partial likelihood with the elastic net penalty.Each coordinate step has a simple, closed form solution.By employing warm starts we have increased the stability of the algorithm and found solutions for a path of penalty parameters.In simulations and an actual dataset we have shown our algorithm to be significantly faster than the competition.A new version of glmnet (Friedman et al. 2011) incorporating coxnet is available on CRAN.
One obvious bottleneck in our algorithm is the computation of w k and z k in Equations 6 and (7).For each i in C k we need to calculate j∈R i e ηk , the sum of hazards of those still at risk.Because both C k and R i have O(n) elements, this is naively an O(n 2 ) calculation.However, if we note that

Table 1 :
Contains timings (secs) for coxnet and coxpath with lasso penalty; total time for λ path averaged over 3 trials.of those to build the model and validating on the kth (often via the predictive likelihood); cycling through this procedure, validating on each of the k pieces in turn, and then averaging or summing the k different deviances.Cross validation in coxnet works in a similar fashion, but with a few subtle differences.Consider the extreme case k = n, or leave one out cross validation.The partial likelihood on the left out sample is either ill-defined (if the left out sample was right censored) or identically 1 for all β.In this case cross validation tells us nothing about the optimality (or lack thereof) of our model.Because the partial likelihood is not as nicely separable as the Gaussian log likelihood (or any exponential family) naively