Coordinate descent methods for the penalized semiparametric additive hazards model

. For survival data with a large number of explanatory variables, lasso penalized Cox regression is a popular regularization strategy. However, a penalized Cox model may not always provide the best ﬁt to data and can be diﬃcult to estimate in high dimension because of its intrinsic nonlinearity. The semiparametric additive hazards model is a ﬂexible alternative which is a natural survival analogue of the standard linear regression model. Building on this analogy, we develop a cyclic coordinate descent algorithm for ﬁtting the lasso and elastic net penalized additive hazards model. The algorithm requires no nonlinear optimization steps and oﬀers excellent performance and stability. An implementation is available in the R -package ahaz and we demonstrate this package in a small timing study and in an application to real data.


Introduction
With the increasing interest in high-throughput biomarker research, there is a growing need for simple and efficient statistical methods for relating a survival time endpoint to a large number of explanatory variables. Variable selection methods such as lasso (Tibshirani, 1997) or SCAD (Fan and Li, 2001) offer convenient means of imposing additional regularity via penalization such that well-known regression models can be straightforwardly adapted to high-dimensional data. By now, many standard survival regression models have been subjected to various penalization strategies (Li, 2008), yet the Cox proportional hazards model continues to serve as a reference model and the main target of theoretical, algorithmic, and applied research on penalized survival regression. Although the Cox model is both flexible and simple to interpret, alternative modeling strategies deserve a wider appreciation for a number of reasons. For example, Ma et al. (2010) recently pointed out a fact which is well known from a lower-dimensional setting: that a Cox model may not always provide a satisfactory fit to a high-dimensional data set. Moreover, with the increasingly high-dimensional data available today, the intrinsically nonlinear Cox model is a peculiar reference model in terms of the computational efficiency and stability of fitting procedures. A range of algorithms have been developed for fitting penalized Cox models (Gui and Li, 2005;Park and Hastie, 2007;Sohn et al., 2009;Goeman, 2010, and others) but their computational performance is limited by the use of costly Newton-Raphson iterations or similar to deal with the penalized partial likelihood.
Only recently did Simon et al. (2011) describe an impressively fast algorithm for fitting the penalized Cox model which combines iteratively reweighted least squares with cyclic coordinate descent. Cyclic coordinate descent optimizes a convex loss function by solving all coordinatewise optimization problems in an iterative manner. While not a new technique in the context of penalized regression (see the references in Friedman et al. (2010)), cyclic coordinate descent has recently been rediscovered for its ability to efficiently handle even very high-dimensional problems when carefully implemented. For generalized linear models and the Cox model, software for performing coordinate descent-based penalized estimation is available in the R-package glmnet .
In this paper, we develop a cyclic coordinate descent algorithm for the elastic net penalized variant of a flexible but less well-known alternative to the Cox model, the so-called semiparametric additive hazards model (Lin and Ying, 1994;McKeague and Sasieni, 1994). This model asserts a hazard function given by the sum of some baseline hazard function and a regression function of the explanatory variables. It is a survival analogue of the standard linear regression model and leads to natural estimating equations which are surprisingly similar to the normal equations. The flexibility and computational parsimony of the additive hazards model makes it a useful tool on which to base regularization methods for high-dimensional survival data (Ma et al., 2006;Ma and Leng, 2007;Scheike, 2009, 2010). We describe how computational tricks for the penalized linear regression model can be adapted to obtain a very efficient and stable coordinate descent method for fitting the elastic net penalized additive hazards model. In contrast to coordinate descent methods for the penalized Cox model, convergence is theoretically guaranteed for our algorithm. The algorithm has been implemented in C to interface with the R-package ahaz (Gorst-Rasmussen, 2011), and we provide examples of its usage and performance on simulated and real data.

The semiparametric additive hazards model
Suppose that we observe (T 1 , δ 1 , Z 1 ), . . . , (T n , δ n , Z n ) where T i is a (right-censored) survival time, δ i is the indicator which is 1 if subject i experiences an event at time T i and 0 otherwise, and Z i ∈ R p is a vector of explanatory variables. To simplify notation, we will describe each pair (T i , δ i ) via the counting process N i (t) := I(T i ≤ t)δ i and the at-risk-process Y i (t) := I(t ≤ T i ) where I denotes the indicator function. The counting process integral t 0 f (s)dN i (s) is then simply a notationally convenient way of writing f (T i )I(T i ≤ t)δ i .
The semiparametric additive hazards model (Lin and Ying, 1994;McKeague and Sasieni, 1994) asserts a conditional hazard function of the form with λ 0 some unspecified baseline hazard constituting the nonparametric part of the model. Lin and Ying (1994) proposed to perform estimation in this model via estimating equations which mimick the score equations for the Cox model. Specifically, they proposed to estimate β 0 as the root of the pseudo-score function whereΛ 0 is a Breslow-type estimator of the cumulative baseline hazard t 0 λ 0 (s)ds, .
Solving U (β) = 0 is equivalent to solving the p × p linear system of equations the at-risk-average of the Z i s. The estimator obtained from (2) can be shown to be root-n consistent by martingale arguments (Lin and Ying, 1994).
The estimating equation (2) is attractive for several reasons. Not only does it provide an explicitly calculable estimator in a flexible semiparametric model; it is also analytically very similar to the normal equations (X X)β = X y for the classical linear regression model y = Xβ 0 + ε. In fact, defining 'responses' y i = dN i (t) and 'explanatory variables' X i = (Z i −Z(t))Y i (t), it is seen that (2) is simply a time-averaged version of the normal equations. The similarity between (2) and the normal equations was exploited by Ma and Leng (2007) and Martinussen and Scheike (2009) to construct a lasso penalized estimator for the additive hazards model. They noted that solving (2) is equivalent to minimizing the loss function leading to a lasso penalized variant with a loss function of the form Here · 1 is the 1 -norm while λ ≥ 0 is a parameter controlling the degree of regularization. Because of geometric properties of the 1 -norm, the lasso penalized estimator argmin β L pen (β; λ) does shrinkage and variable selection simultaneously (Tibshirani, 1997). For large values of λ most lasso regression coefficients will be exactly zero -as λ grows smaller, the lasso regression coefficients become increasingly similar to their unpenalized counterparts. Ma and Leng (2007) and Martinussen and Scheike (2009) proposed to use the lasso-LARS algorithm (Efron et al., 2004) to calculate the lasso penalized estimator argmin β L pen (β; λ). The lasso-LARS algorithm for the standard linear regression model is easily adapted to work with the additive hazards model by supplying pre-computed versions of D (in place of the covariance matrix) and d (in place of the covariate-response inner products). However, pre-computation of D may be unfeasible for large p. Even without pre-computation, the computational cost of lasso-LARS is similar to that of solving the unpenalized regression problem which is substantial for large p. In the following, we propose cyclic coordinate descent as a much more efficient alternative.

Model fitting via cyclic coordinate descent
Since the extension is straightforward, we will work with a variant of (6) which includes an 2 penalty term. That is, we consider the problem of minimizing the following penalized loss: We denote henceforthβ (λ) := argmin β L pen (β; λ, α).
The 1 / 2 penalization in (7) is known as elastic net penalization (Zou and Hastie, 2005). When α = 1, the loss function reduces to lasso penalized loss. If α < 1, the loss function favors joint selection of highly correlated variables. This follows by similar arguments as in Zou and Hastie (2005), utilizing the heuristic interpretation of D as a time-averaged covariance matrix. We have omitted the dependence of α in the left-hand side of (8) for notational simplicity.
Cyclic coordinate descent is a numerical optimization technique which approximates the minimum of a function f : R p → R by iteratively for k = 0, 1, 2, . . . cycling through the p coordinatewise optimization problems fixing for the update of the jth coordinate all other coordinates at their most recent value. For a convex f satisfying certain separability conditions, the iterates x (k) converge to argmin x∈R p f (x), irrespective of x (0) (Tseng, 1988). It suffices that f is a convex and continuously differentiable function subjected to elastic net penalization.
To use cyclic coordinate descent to calculate (8), simply observe that It follows that the updating rule (9), for a given value of (λ, α), takes on the form where S denotes the soft-thresholding operator While convexity ensures theoretically that β (k) converges toβ(λ), convergence can be very slow if β (0) is poorly chosen. Fundamental to ensuring rapid convergence and stability of coordinate descent are the following two structural properties of the elastic net problem: 1. If λ is sufficiently large thenβ(λ) = 0 (sparsity).
Hence,β(λ) for someλ can be calculated efficiently and stably via coordinate descent by calculating a pointwise regularization pathβ(λ max ), . . . ,β(λ) at a grid of closely spaced λ-values; starting out with some large λ max so thatβ(λ max ) = 0 and using the most recent solutionβ(λ l−1 ) as the initial value in the coordinate descent algorithm forβ(λ l ). This idea of using 'warm starts' was discussed in more detail by Friedman et al. (2007) and Friedman et al. (2010). For the penalized loss (7), it is easily seen that we obtainβ(λ max ) ≡ 0 by taking Following Simon et al. (2011), we consider an exponentially decreasing sequence of regularization parameters of length m from λ max to some λ min < λ max such that If we denote ε := λ min /λ max , a reasonable, although arbitrary, choice is to take m = 100 and ε = 0.0001 if n < p and ε = 0.05 if p ≥ n.
Naively, one would run the coordinate descent algorithm over all p coordinates (i.e. using all p variables) to obtainβ(λ 0 ), . . . ,β(λ m ). This is clearly undesirable for large p since it requires calculation of the entire matrix D. However, givenβ(λ) for some λ, the Karush-Kuhn-Tucker (KKT) conditions for the constrained optimization problem (7) imply thatβ j (λ) = 0 iff This leads to the active set strategy (Friedman et al., 2007): we maintain at all times a set A of 'active variables' which are included in the coordinate descent algorithm, starting out with A := ∅ at λ max . Upon convergence of coordinate descent among variables in A, we check (11) for each variable in {1, . . . , p}\A. If there are no violations, we have the final solution. If there are violations, we add the violators to A and restart the coordinate descent algorithm. With this approach, it is seen from (11) that we need only calculate rows D j• for j ∈ A.
Various stopping criteria can be used in step 1; either based on the relative change in the individual coefficient estimates or based on the relative change in the penalized loss function. We prefer the latter since the loss function is less susceptible to instabilities when many variable are included or when near a saturated fit. Specifically, we declare convergence when the relative change in L pen {β (k) (λ)} from one value of k to the next is less than 10 −5 .
Note that for α = 1, at most n − 1 variables can be included in the model, by the nature of the lasso penalized problem. In most cases, the user will specify some maximum number of variables to include which is strictly less than n.

Efficient calculation of D
The calculation of rows in the matrix D is the primary bottleneck of our basic coordinate descent algorithm for p large. In contrast to the partial likelihood in the Cox model, which essentially only depends on data at failure times, calculation of D uses data at both censoring and failure times.
Fortunately, it turns out that (3) can still be evaluated rather efficiently.
Suppose that survival times are ordered such that T 1 > T 2 > · · · > T n , assuming no ties. Denote ∆ k := T k − T k+1 (taking T n+1 := 0) and assume that variables are centered so that n i=1 Z i = 0. By applying the summation by parts formula, we obtain where we have defined Hence, if we pre-calculate and storeZ i1 , . . . ,Z in , the subsequent calculation of each matrix element D ij can be accomplished at the modest cost of 2n arithmetic operations.

Increasing efficiency via improved KKT checks
While our basic coordinate descent algorithm is already quite efficient, there is room for improvement. Denote byp the size of the active set A at λ min . In retrospect, we need onlyp 2 entries in the matrix D to construct a regularization path; the remaining (p −p) ·p entries are used only for the KKT checks (11). A substantially more efficient KKT check can be devised by noting from (3) where taking R λ i := Z iβ (λ) to be the linear risk score of the ith subject. Formulas as in Section 3.1 can be used for evaluating (14). Substituting (13) in (11), it follows that we can perform the necessary KKT checks by calculating the vector r(λ) and subsequently evaluating p − |A| inner products between n-vectors. Whenever a new variable j enters the model, symmetry of the matrix D implies that we need only calculate D ij for i ∈ A to be able to run the coordinate descent updates (12). This is a substantial improvement over the basic coordinate descent algorithm in which the entire row D j• must be calculated for each new variable j.
An issue not addressed by this improved strategy is that KKT checks often fail. In fact, they fail at least whenever a new variable enters the model and in practice much more frequently. A failed check leads to a restart of the coordinate descent loop. Although another run of coordinate descent is rarely very expensive when using warm starts, calculating p−|A| inner products between n-vectors for the next KKT check is costly. The cost could be reduced if we could first run the coordinate descent/check/restart procedure on a set of variables which is larger than the active set but still smaller than p; and outside which KKT violations are rare. Tibshirani et al. (2010) recently showed how to construct such a set. Adapting their formulas to the present problem, given some γ > λ, they proposed the following sequential strong condition and argued that if a variable j satisfies this condition then typicallyβ j (λ) = 0. The sequential strong condition is not failsafe and (15) may hold true ifβ j (λ) = 0. The point is that this happens rarely. Consequently, by introducing the strong set we may further improve efficiency of coordinate descent via the following strategy for each λ: 1. Run coordinate descent for variables in A until convergence.
2. Check for violations of KKT conditions among variables in S only, using (13). If violations occur, add violators to A and go back to step 1, using the current solution as a warm start. Otherwise proceed to step 3.
3. Check for violation of KKT conditions among variables in {1, . . . , p}\S using (13). If violations occur, add violators to A, update S, and go to step 1, using the current solution as a warm start. Otherwise we have obtained the solution for this value of λ.
This strategy is an improvement since we tend to restart the algorithm fewer times in step 3. Accordingly, fewer inner products must be calculated. Other approximate discarding rules than (15) could be used instead since we always conclude by running a fail-safe check of KKT conditions among all variables.

Implementation in ahaz
The optimized version of the algorithm described in this section has been implemented in C to interface with the R-package ahaz (Gorst-Rasmussen, 2011) via the wrapper function ahazpen.
Since all calculations are done in C, the code can easily be adapted to work with other front-ends than R.
The bottleneck of the algorithm is calculating the roughly p inner products between n-vectors. This can account for 50%-90% of the computation time. As also noted by Tibshirani et al. (2010), simultaneous inner product evaluations are embarrassingly parallel, suggesting good scalability of the algorithm. We have implemented the inner product evaluations via level 2 calls to the BLAS libraries linked to R, thus enabling the user to improve speed of ahazpen further by linking R against high-performance BLAS libraries such as GotoBLAS or ATLAS.

Additional details
The implementation of cyclic coordinate descent provided in ahazpen supports a similar set of options as glmnet ; including observation weighting and differential penalization. Specifically, for nonnegative weights w 1 , . . . , w p , ahazpen can accommodate a penalized loss function of the form In the simplest case, differential penalization can be used to completely exclude a variable from penalization (by setting w j := 0), offering a simple alternative to the more sophisticated approach of unpenalized adjustment discussed by Martinussen and Scheike (2009). Differential penalization also enables implementation of techniques such as adaptive lasso (Zou, 2006).

Delayed entry
An approach which is common in, for example, survival epidemiological studies is adjust for the age of study subjects by using it as a time axis in hazard regression models. This is popularly known as delayed entry (or left-truncation) and requires us to consider data of the form (S 1 , T 1 , δ 1 ), . . . , (S n , T n , δ n ) where 0 ≤ S i < T i is the entry time of the ith individual. By keeping N i (t) = I(T i ≤ t ∧ δ i = 1) but setting Y i (t) = I(S i ≤ t ≤ T i ), the regression models described in Section 2 extend straightforwardly to the delayed entry case.
Computer implementation of delayed entry is slightly more involved. Define for i = 1, 2, . . . , n the following collection of 'pseudo observations': Hence we can deal with delayed entry by replacing the original n observations with 2n pseudo observations and using the algorithms developed for the case where S 1 = S 2 = · · · = S n = 0.
The support for delayed entry is not only useful for implementing nonstandard time axes. It can also be used to implement (piecewise constant) time-varying explanatory variables, as well as to implement observations from more general counting processes.

Tuning parameter selection
A complete lasso or elastic net regularization path is useful mainly for judging relative importance of variables. In practice, the experimenter typically seeks the solution for a single value of the regularization parameter λ which can then be used similarly to how a set of unpenalized regression coefficients would be used. To select a 'representative' value of λ, cross-validation is commonly employed. As argued in Martinussen and Scheike (2009), the loss function (5) for the additive hazards model can be interpreted as a 'prediction error' within a quite general setting. It follows that if F 1 , . . . , F K is a partition of {1, . . . , n}, each F i being roughly the same size, we may define a cross-validation score with L (F i ) the loss calculated using observations from F i only, andβ (−F i ) (λ l ) the penalized regression coefficients based on observations in {1, . . . , n}\F i only. We then selectλ := argmin λ l CV(λ l ) as the optimal λ-value.
In ahaz, 5-fold cross-validation (K = 5) is the default and offers an acceptable compromise between accuracy and stability in moderately sized data sets. In small data sets cross-validation can be somewhat unstable. For this reason, ahaz also supports repeated cross-validation where CV(λ) is averaged over several independent splits of {1, . . . , n} into folds.
It is also possible to select λ via criteria similar to BIC (or AIC). Although the loss function (5) is not based on a likelihood, we may still define the following analogue to BIC, PBIC(λ) := κL(β) + df{β(λ)}f (n); where κ is some scaling constant. A convenient estimate of df{β(λ)} is β (λ) 0 , the number of nonzero variables inβ(λ) (Zou et al., 2007). Because the loss function L is of the least-squares type, the arguments of Wang and Leng (2007) can be used to show that for p fixed, if n −1 f (n) → 0 and f (n) → ∞, the choiceλ := argmin λ PBIC(λ) entails certain selection consistency properties, depending on the underlying penalization method. For example, we can take f (n) := log n (Gorst-Rasmussen and . In ahaz, we use where all quantities are calculated within the set A of nonzero variables at the smallest value of λ used, B is an estimator of the asymptotic covariance matrix of d, and X − denotes the Moore-Penrose inverse. This choice of κ ensures that PBIC scales like a true BIC. Observe that (16), since it depends on the end point of the regularization path (through κ), is a sensible selection criterion primarily when p < n.

Timings and a data example
This section presents timing results for ahazpen, alongside an example of its usage on a real data set. We keep our timing study brief since previous work for other statistical models present a strong case that well-designed coordinate descent algorithms are universally faster than competing lasso fitting methods Simon et al., 2011), Simon et al. (2011) used simulated data from a basic accelerated failure time model to assess runtimes of coordinate descent methods for the penalized Cox model. We adopt their simulation model for our runtime assessments and consider explanatory variables Z i which are independent and identically distributed marginally standard Gaussian p-vectors satisfying Cor(Z 1j , Z 1k ) = ρ for j = k. True survival times are generated conditionally on the Z i s as

Timings
where β j := (−1) j exp(−2(j − 1)/20) and W i is a mean zero Gaussian random variable with variance such that the signal-to-noise ratio is 3.0. Censoring times are generated as C i := exp(W i ) and the observed survival times as T i := min(C i ,T i ).
We compare the runtime of ahazpen with that of surv.lars from the R-package timereg (Scheike and Zhang, 2011) which is currently the only publicly available software for fitting the lasso penalized additive hazards model. The surv.lars function is a modified version of the lars function from the package lars and requires pre-calculation of the quantities D and d. We use a highly efficient C-routine for calculating D based on formulas as in Section 3.1 (function ahaz in the package ahaz). To make ahazpen and surv.lars reasonably comparable, we stop surv.lars after 100 steps, and use the corresponding smallest λ-value λ min to construct an exponentially decreasing λ-sequence for ahazpen of length 100 as in (10).
Experiments were run on an Intel Core I7 2.93 GHz, 8 GB RAM system with standard BLAS.
Runtimes for different values of n, p, and ρ are shown in Table 1 (averaged over 3 repetitions). Table 2 shows the corresponding runtimes of the pre-calculation part of lasso-LARS (averaged over the three repetitions and ρ as well). Coordinate descent is universally faster than lasso-LARS, especially for large values of p. The bottleneck of lasso-LARS is obviously the pre-calculation of D which has complexity of order O(np 2 ). Neither algorithm is much affected by large correlations. Table 3 shows runtimes of ahazpen for very large values of n or p (averaged over 3 repetitions), based on a path of 100 λ-values with λ min chosen such that the maximal number of variables in the path is roughly 100. It is seen that the algorithm is fully capable of dealing with large amounts of data. It runs more slowly for very large n than for very large p since the calculations in the strong/active sets become costly as well for large n. In our detailed assessments (not shown), the runtime scaled approximately linearly in n for fixed p and vice versa. Similar behavior was reported by Simon et al. (2011) for their coordinate descent algorithm for the penalized Cox model. Finally, a negative effect of large correlations on runtimes starts to become apparent for these large problems.
It is tempting to compare the raw computational performance of ahazpen with that of the glmnet coxnet function for fitting lasso penalized Cox models (Simon et al., 2011;Friedman et al., 2010). Such a comparison can only be qualitative and superficial since the algorithms solve different problems, use different convergence criteria, and rely on completely independent implementations. Intuitively, one might expect ahazpen, which is based on a linear model, to be substantially faster than coxnet. This is not the case. In fact, our limited experiments with glmnet (version 1.7) suggest that the two methods often have surprisingly similar runtimes, both being roughly as fast as glmnet coordinate descent for the simple linear regression model for an equally sized problem. A plausible explanation is that comparatively little time is spent on nonlinear optimizations because of the efficient use of active-set calculations. On the other hand, for very large n, ahazpen can be more efficient than coxnet since the coordinate descent part of ahazpen is essentially 'kernelized' (via the use of D, d); whereas coxnet does coordinate descent via inner products between nvectors. Also, cross-validation tends to be somewhat faster for ahazpen than for coxnet since it is based on the simple quadratic loss (5).
An appreciable and implementation-independent advantage of the linearity of the additive hazards model is the guaranteed convergence ahazpen. It is also our experience that the runtime of ahazpen is more predictable than that of coxnet where convergence of the nonlinear optimization part can be sensitive to the nature of the data considered.

An example using real data
To demonstrate the practical use of ahazpen, we consider the Sørlie data set (Sørlie et al., 2003) which consists of 549 gene expression measurements and survival times for 115 women diagnosed with breast cancer. This data set was also used by Martinussen and Scheike (2009) to demonstrate the lasso penalized additive hazards model.
We consider the challenging problem of performing lasso penalized survival regression for both main effects and pairwise (multiplicative) interactions of gene expressions. The design matrix has p = 549 + 549 · (549 − 1)/2 = 150, 975 columns. We apply the additive hazards lasso directly to this design matrix, ignoring here the discussion whether it is sensible to allow for inclusion of interactions without the corresponding main effects.   Table 3: Runtime (seconds) for ahazpen averaged over 3 repetitions.
We load and format the data as follows (note that generating X may take several minutes): R> data("sorlie") R> set.seed(10101) R> surv <-Surv(sorlie$time + runif(nrow(sorlie)) * 1e-2, sorlie$status) R> Z <-sorlie[,3:ncol(sorlie)]; p <-ncol(Z) R> pw.comb <-combn(1:p,2) R> X <-cbind (Z, Z[, pw.comb It is common practice to put variables on the same scale before applying the lasso. In ahazpen, data is scaled by default (estimates are returned on the original scale) so it is not necessary standardize data manually. We make the following call to ahazpen to fit the lasso; the choice of penalty corresponds to the default value and is included here for completeness: R> fit.init <-ahazpen(surv, X, dfmax = 50, penalty = lasso.control (alpha=1) To prevent ahazpen from calculating a complete regularization path, dfmax has been specified. This option is useful for reducing computation time since, in practice, the lasso often prefers rather sparse solutions. Only 32 λ-values are used even though ahazpen is set to use 100 λ-values as default. This is because ahazpen cannot anticipate the λ-value at which dfmax is reached and hence simply truncates the default λ-sequence (10). A grid of λ-values with the desired density is easily obtained by a second call to ahazpen: R> l <-range(fit.init$lambda) R> fit <-ahazpen(surv, X, lambda.minf = l[1] / l[2]) R> plot(fit) Evaluating this ahazpen call took roughly 9 seconds. A plot of the regularization path is shown in Figure 1 (left).
To determine an optimal value of λ, we use 5-fold cross-validation as follows: R> plot(fit.tune) Typically, K-fold cross-validation takes about as long as running ahazpen K + 1 times. Figure 1 (right) shows the curve of cross-validation scores. Indices of the final nonzero regression coefficients are then obtained as follows R> beta <-coef(fit.tune) R> which(as.numeric(beta) != 0) [1] 21 269 346 401 Apparently, the lasso prefers a model containing main effects only. Regression coefficients # nonzero coefficients 0 14 31 43 46 47 50 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

Discussion
Cyclic coordinate descent is a simple numerical optimization method which works exceptionally well for penalized regression problems with variable selection. We have developed a coordinate descent algorithm for the elastic net penalized additive hazards model and provided an implementation via the ahazpen function in the R-package ahaz. This function can handle very large amounts of data efficiently and is an important and flexible alternative to the more commonly used elastic net penalized Cox model. In terms of computational properties, the additive hazards model is intrinsically linear which implies theoretically guaranteedconvergence of ahazpen and highly predictable runtimes in practice. Our specific implementation provides support for survival data with delayed entry which in turn enables the use of more complex data types such as nonstandard time axes, time-varying covariates, and general counting process data.