High-dimensional additive hazard models and the Lasso

We consider a general high-dimensional additive hazard model in a non-asymptotic setting, including regression for censored-data. In this context, we consider a Lasso estimator with a fully data-driven $\ell_1$ penalization, which is tuned for the estimation problem at hand. We prove sharp oracle inequalities for this estimator. Our analysis involves a new"data-driven"Bernstein's inequality, that is of independent interest, where the predictable variation is replaced by the optional variation.


Introduction
Recent interests have grown on connecting gene expression profiles to survival patients' times, see e.g. [30,34], where the aim is to assess the influence of gene expressions on the survival outcomes. The statistical analysis of such data faces two sorts of problems. First, the covariates are high-dimensional: the number of covariates is much larger than the number of observations. Second, the survival outcomes suffers from censoring, truncation, etc. The need of proper statistical methods to analyze such data, in particular high-dimensional right-censored data, led in the past years to numerous theoretical and computational contributions.
When the survival times suffer from right-censoring, the problem can be presented as follows. For an individual i ∈ {1, . . . , n}, let T i be the time of interest (e.g. the patient survival time), let C i be the censoring time and X i be the vector of covariates in R d , assumed to be independent copies of T , C and X = (X 1 , . . . , X d ). We observe Z i = T i ∧ C i , δ i = 1(T i ≤ C i ) and X i for i = 1, . . . , n.
The covariates vector X, where both genomic outcomes and clinical data may be recorded, is in high dimension d ≫ n and influences the distribution of T via its conditional hazard rate given X = x, defined by for t > 0, where f T |X and F T |X are respectively the conditional density and distribution functions of T given X = x. In the following, we assume that the conditional hazard fulfills the Aalen additive hazards model [1]: where λ 0 is the baseline hazard function and β 0 measures the influence of the covariates on the conditional hazard function α 0 . In [21], an additive hazards model is fitted to investigate the influence of the expression levels of 8810 genes on the (censored) survival times of 92 patients suffering from Mantel-Cell Lymphoma, see [30] for the data. The Aalen additive hazards model is indeed an useful alternative to the Cox model [10], in particular in situations where the proportional hazards assumption is violated. It can also "be seen as a first-order Taylor series expansion of a general intensity" (see [23], p. 103).
When the aim is then to understand the influence of X on the survival time T , one wants to estimate β 0 based on the observations. In small dimension d ≪ n and from the data (Z i , δ i , X i ) i=1,...,n , the least-squares estimatorβ of the unknown β 0 is the minimizer of the quadratic functional where H n is the d × d symetrical positive semidefinite matrix with entries and where h n ∈ R d has coordinates .
When d ≤ n and if H n is full rank, we can writê see also [19] or [25]. The estimatorβ is √ n-consistent and asymptotically Gaussian, see e.g. [2]. When X contains genomic outcomes, one typically has d ≫ n, and the matrix H n is no longer of full rank. A sparsity assumption is then natural in this setting: we expect only a few genes to have an influence on the survival times, so we expect β 0 to be sparse, which means that it has only a few non-zero coordinates. Several papers use sparsity inducing penalization in the context of survival analysis, mainly for the Cox multiplicative risks model or the Aalen additive risks model, we refer to [35] for a review. Most procedures are based on ℓ 1 -penalization, where one considerŝ The smoothing parameter λ > 0 makes the balance between goodness-of-fit and sparsity, and the w j ≥ 0, j = 1, . . . , d are weights allowing for a precise tuning of the penalization. The Lasso penalization corresponds to the simple choice w j = 1, while in the adaptive Lasso [38] one chooses w j = |β j | −γ whereβ j is a preliminary estimator and γ > 0 a constant. The idea behind this is to correct the bias of the Lasso in terms of variable selection accuracy, see [38] and [37] for regression analysis. The weights w j can also be used to scale each variable at the same level, which is suitable when some variable has a strong variance compared to the others. As a by-product of the theoretical analysis given in this paper, we introduce a new way of scaling the variables using data-driven weightŝ w j in the ℓ 1 penalization, see (14) below.
For the additive risks, [22] considers principal component regression, [21] considers a Lasso with a least-squares criterion that differs from the one considered here, [18,25] considers the ridge, Lasso and adaptive Lasso penalizations and [24] considers the partial least-squares and ridge regression estimators.
A serious advantage, from the computational point of view, in using additive risks over multiplicative risks has to be highlighted. Indeed, for the additive risks, the estimating Equation (1) has a least-squares form, so that one can apply in this case the fast Lars algorithm [11] in order to obtain the whole path of solutions of the Lasso, as explained in [18] for instance. This point is particularly relevant in practice, since one typically uses splitting techniques, such as cross-validation, to select the smoothing parameter, or ensemble feature methods, such as stability selection [27], to select covariates. The motivations and main contributions of this work are enumerated in the following.
First motivation. Among the papers that propose some mathematical analysis of the statistical properties of estimators of the form (1) (upper bounds, support recovery, etc.), the results are asymptotic in the number of observations. This can be a problem since, in practice, one can not, in general, consider that the asymptotic regime has been reached: in [30], for example, the expression levels of 8810 genes and survival information are measured for only 92 patients. Considering only the references that are the closest to the work proposed here, the oracle property for the adaptive Lasso is given in [18], which is an asymptotic property about the support and the asymptotic distribution of the estimator, and asymptotic normality and consistency in variable selection for the adaptive Lasso is proved in [25], where results about the Dantzig selector are also derived using the restricted isometry property and the uniform uncertainty principle from [8]. While non-asymptotic results, like sparse oracle inequalities for instance, are now well-known for regression or density estimation (see for instance [7], [5], [4], among many others), such results are not yet available for survival data. In this paper, we establish the first results of this kind for survival analysis.
Second motivation. We give sharp oracle inequalities (with leading constant 1) for the prediction error associated to the survival problem. The results are stated for general counting processes, including the censoring case, while most papers consider censored data only. Our results are stated without the assumption that the intensity is linear in the covariates. In fact, our Lasso estimator can be computed using an arbitrary dictionary of functions, so that one can expect a better approximation of the true underlying intensity.
Third motivation. In order to prove our results, we need a new version of Bernstein's inequality for martingales with jumps, where the predictable variation, which is not observable in this problem, is replaced by the optional variation, which is observable. This concentration inequality is of independent interest, and could be useful for other statistal problems as well.
Fourth motivation. Finally, and more importantly, our non-asymptotic analysis leads to an adaptive data-driven weighting of the ℓ 1 -norm, that involves the optional variation of each element of the dictionary (or of each covariate in the linear case). More precisely, our sharp control of the noise term exhibits the fact that the ℓ 1 -penalization (see (1)) should be scaled using data-driven weights of order (writing only the dominating terms, see Section 3 for details)ŵ corresponds, roughly, to an estimate of the variance of variable j. Hence, our theoretical analysis exhibits a new way of tuning the ℓ 1 penalization, by multiplying each coordinate by this empirical variance term, in order to make less apparent eventual differences between the variability of each X j for j = 1, . . . , d. This particular form of weighting, or scaling of the variables, was not previsouly noticed in literature. The paper is organized as follows. Section 2 describes the model. The Lasso estimator is constructed in Section 3. Oracle inequalities for the Lasso are given in Section 4, see Theorems 1 and 2. Some details about the construction of the least-squares criterion are given in Section 6.1. The data-driven Bernstein's inequality is stated in Section 5, see Theorem 3, and the proofs of our results are given in Section 6.

High dimensional Aalen model
Let (Ω, F, P) be a probability space and (F t ) t≥0 a filtration satisfying the usual conditions: increasing, right-continuous and complete (see [14]). Let N be a marked counting process with compensator Λ with respect to (F t ) t≥0 , so that M = N − Λ is a (F t ) t≥0 -martingale.
We assume that N is a marked point process satisfying the Aalen multiplicative intensity model. This means that Λ writes for all t ≥ 0, where: • the intensity α 0 is an unknown deterministic and nonnegative function called intensity • Y is a predictable random process in [0, 1].
With differential notations, this model can be written has for all t ≥ 0 with the same notations as before, and taking N 0 = 0. Now, assume that we observe n i.i.d. copies where τ is the end-point of the study. Without loss of generality, we set τ = 1. We can write In this setting, the random variable N i t is the number of observed failures during the time interval [0, t] of the individual i. This model encompasses several particular examples: censored data, marked Poisson processes and Markov processes, see e.g. [2] for a precise exposition. In the censored case, described in the Introduction, the random processes in D n , see Equation (4), are given by for i = 1, . . . , n and 0 ≤ t ≤ 1.
In this paper, we assume that the intensity function satisfies the Aalen additive model in the sense that it writes where λ 0 : R + → R + is a nonparametric baseline intensity and h 0 : R d → R + . Note that in the "usual" Aalen additive model, see [19,26,24,25], the function h 0 is linear: where β 0 is an unknown vector in R d . The aim of the paper is to recover the function h 0 based on the observation of the sample D n .

A least-squares type functional
The problem considered here is a regression problem: we want to explain the influence of the covariates X i on the survival data N i and Y i . Namely, we want to infer on h 0 , while the baseline function λ 0 is considered as a nuisance parameter. Thanks to the additive structure (5), we can construct an estimator of h 0 without any estimation of λ 0 , so that the influence of the covariates on the survival data can be infered without any knowledge on λ 0 . This classical principle leads to the construction of the partial likelihood in the Cox model (multiplicative risks, see [10]) and to the construction of the "partial" leastsquares (in reference to the partial likelihood) for the additive risks, see [19], which is the one considered here. The "partial least-squares" criterion for a "covariate" function h : whereh It has been first introduced in [19]. The main steps leading to (6) are described in Section 6.1 below, where we explain why it is indeed suitable for the estimation of h 0 (see in particular Equation (20)). Now, we consider a set The set H can be a collection of basis functions, that can approximate the unknown h, like wavelets, splines, kernels, etc. They can be also estimators computed using an independent training sample, like several estimators computed using different tuning parameters, leading to the so-called aggregation problem, see [6] for instance. Implicitely, it is assumed that the unknown h 0 is well-approximated by a linear combination where β ∈ R M is to be estimated. However, note that we won't assume, for the statements of our results, that the unknown h 0 is equal to h β 0 for some unknown β 0 ∈ R M , hence allowing for a model bias. Note that the setting considered here includes the linear case: we define the least-squares risk of β ∈ R M as which is equal to the functional (6) where we applied (7). Note that (9) is a least-squares criterion, since where H n is the M × M matrix with entries and where h n ∈ R M has coordinates Since H n is a symetrical positive semidefinite matrix, we can take where |x| 2 stands for the ℓ 2 -norm of x ∈ R n . Note that we will denote by |x| p the ℓ p norm of x.

ℓ 1 -penalization for the Aalen model
For the problem considered here, we have seen that the empirical risk R n has to be chosen with care. This is also the case for the ℓ 1 penalization to be used for this problem. Namely, for a well-chosen sequence of positive data-driven weightsŵ = (ŵ 1 , . . . ,ŵ M ), we consider the weighted ℓ 1 -norm and chooseβ according to the following penalized criterion: where B is an arbitrary convex set (typically B = R M or B = R M + , the latter making hβ n non-negative). The weights considered in (13) are given byŵ j =ŵ(h j ) (where we recall that h j ∈ H) and where for any function h, we takê where: •V (h) is a term corresponding to the "observable empirical variance" of h (see below for details), given byV •l n,x (h) is a small technical term coming out of our analysis: Note that the weightsŵ j are fully data-driven. The shape of these weights comes from a new empirical Bernstein's inequality involving the optional variation of the noise process of the model, see Theorem 3 in Section 5 below. The penalization (12) is tuned for the estimation problem at hand. It uses the estimator V (h) of the (unobservable) predictable quadratic variation and it does not depend on an uniform upper bound for h. As a consequence, it can give, from a practical point of view, some insight into the tuning of the ℓ 1 -penalization. In particular, our analysis prove that the j-th coordinate of β in the ℓ 1 penalization should be rescaled byV (h j ) 1/2 . Note that this was not previously noticed in literature, in part because most results are stated using an asymptotic point of view, see the references mentioned in Introduction.
Below are two oracle inequalities for hβ. The first one (Theorem 1) is a "slow" oracle inequality, with a rate of order (log M/n) 1/2 , which holds without any assymption on the Gram matrix G n . The second one (Theorem 2) is an oracle inequality with a fast rate of order log M/n, that holds under an assumption on the restricted eigenvalues of G n . Theorem 1. Let x > 0 be fixed, and letĥ = hβ, wherê with pen(b) given by (12). Then we have, with a probability larger than 1 − 29e −x : Note that for any β ∈ R, so the dominant term in pen(β) is, up to the slow log log term, of order |β| 1 log M/n, which is the expected slow rate forĥ involving the ℓ 1 -norm (see [5] for the regression model and [7,4] for density estimation). For the proof of oracle inequalities with a fast log M/n rate, the restricted eigenvalue (RE) condition introduced in [5] and [15,16] is of importance. Restricted eigenvalue conditions are implied by, and in general weaker than, the so-called incoherence or RIP assumptions, which excludes strong correlations between covariates. This condition is acknowledged to be one of the weakest to derive fast rates for the Lasso. One can find in [33] an exhaustive survey and comparison of the assumptions used to prove fast oracle inequalities for the Lasso, where the so-called "compatibility condition", which is slightly more general than RE, is described.
The restricted eigenvalue condition is defined below. Note that our presentation (and arguments used in the proof of Theorem 2) is close to [17], where oracle inequalities for the matrix Lasso are given. Let us introduce, for any β ∈ R M and c 0 > 0, the cone The cone C β,c 0 consists of vectors that have a support close to the support of β. Then, introduce The number 1/µ c 0 (β) is an uniform lower bound for |G n b| 2 /|b J(β) | 2 over b ∈ C β,c 0 . Hence, it is a lower bound for "eigenvalues" restricted over vectors with a support close to the support of β. Also, note that c → µ c (β) is non-increasing. with pen(b) given by (12). Then we have, with a probability larger than 1 − 29e −x : Note that so the dominant term is (up to the log log term) of order |β| 0 log M/n. This is the fast rate to be found in sparse oracle inequalities [5,15,8]. Moreover, note that the (sparse) oracle inequality in Theorem 2 is sharp, in the sense that there is a constant 1 in front of the oracle term inf β∈B h β − h 0 2 n , see Remark 2 below. Now, let us state Theorem 2 under the restricted eigenvalue condition.  Remark 1. Note that the constant c 0 = 3 (for µ c 0 (β)) is used in Theorem 2. This is because with a large probability,β − β belongs to the cone C β, 3 . Such an argument of cone constraint is at the core of the convex analysis underlying the proof of fast oracle inequalities for the Lasso, see for instance [8,5,17].
Remark 2. We were able to prove a sharp sparse oracle inequality (with leading constant 1), because we adapted in our context a recent argument from [17], that uses some tools from convex analysis (such as the fact that the subdifferential mapping is monotone, see [29]) in the study ofβ as the minimum of the convex functional R n + pen.

An empirical Bernstein's inequality
The proofs of Theorems 1 and 2 require a sharp control of the "noise term" arising from model (3). For a fixed function h, this noise term is the stochastic process martingales with jumps with jumps of size +1, as we assume the existence of the intensity function α 0 , see (2). In order to give an upper bound on |Z t | that holds with a large probability, one can use Bernstein's inequality for martingales with jumps, see [20], and note that a proof of this fact is implicit in the proof of Theorem 3, see Section 6 below. Applied to the process Z t (h), this writes is the predictable variation of Z t , which will also be referred to as variance term. But, since the term V t (h) depends explicitly on the unknown intensity α 0 , one cannot use it in the penalizing term of the Lasso estimator. Morever, this result is stated on the event {V t ≤ v} while we would like an inequality that holds in general. Hence, we need a new Bernstein's type inequality, that uses an observable empirical variance term instead of V t (h). We prove in Theorem 3 below that we can replace V t (h) by the optional variation of Z t (h), which can be also seen as an estimator of V t (h) and is defined as: Moreover, our result holds in general, and not on {V t (h) ≤ v}. The counterpart for this is the presence of a small log log term in the upper bound for |Z t (h)|.
Theorem 3. For any numerical constants c ℓ > 1, ǫ > 0 and c 0 > 0 such that ec 0 > 2(4/3 + ǫ)c ℓ , the following holds for any x > 0: wherê and where By choosing c ℓ = 2, ǫ = 1 and c 0 = 4(4/3 + ǫ)c ℓ /e = 56/(3e), Inequality (18) holds with the following numerical values: The concentration inequality (18) is fully data-driven, since the random variable that upper bounds |Z t (h)| with a large probability is observable. Note that the numerical values given in Theorem 3 are the one used in the construction of the ℓ 1 -penalization (12). These are chosen for the sake of simplicity, but another combination of numerical values can be considered as well.
The idea of using Bernstein's deviation inequality with an estimated variance is of importance for statistical problems. In [4] for instance, a Bernstein's inequality with empirical variance is derived in order to study the Dantzig selector for density estimation. Note that, however, we are not aware of a previous result such as Theorem 3 for continuous time martingales with jumps, excepted for a work in progress [13], which uses a similar concentration inequalities in the context of point processes.

Decomposition of the least-squares
In this section, we give the details of the construction of the partial least-squares (6). It is based on the decomposition, using the additive structure (5), of the least-squares risk for counting processes depending on covariates, see for instance [28] and [9]. In model (3), on the basis of the observations (4), the least-squares functional to be considered for the estimation of α 0 is given by where α : , we can decompose L n in the following way: where where, as introduced in Section 3: Now, the point is that, according to Lemma 1 below, the term L n,3 is zero.
Lemma 1. For any function h : R d → R + and any function ϕ : Lemma 1 follows from an easy computation which is omitted. The term L n,2 in (19) is the partial least-squares criterion considered in Section 3, see Equation (6). We now explain why it is suitable for the estimation of h 0 . If the Aalen additive model holds, we have dN i t = (λ 0 (t) + h 0 (X i ))Y i t dt + dM i t for all i = 1, . . . , n, so we can write, using again Lemma 1: Now, using the empirical norm · 2 n defined in Equation (15), see Section 3 above, we can finally write The last term in the right hand side of (20) is a noise term, with tails controlled in Section 5 above. It is now understood that finding a minimizer of L n,2 , or a penalized version of it, is a natural way of estimating h 0 . We refer the reader to [25] for an other justification of the "partial least-squares" criterion in the linear case h 0 (x) = x ⊤ β 0 .

Proof of Theorem 3
For i = 1, . . . , n, the processes N i are i.i.d. counting processes satisfying the Doob- Note that |H i t | ≤ 1. Since H i is predictable and bounded, the process U is a square integrable martingale, as a sum of square integrable martingales. Its predictable variation U is given by:

while its optional variation [U ] is given bŷ
From [32], we know that is a supermartingale if S λ is the compensator of We now derive the expression of S λ . The process E can also be written as where the last inequality holds almost surely, since the M i are independent, hence do not jump at the same time (with probability 1). Now, note that The fact that (21) is a supermartingale entails for any λ, x > 0. The following facts hold true: 2(1−λ/3) for any λ ∈ (0, 3) ; • min λ∈(0,1/b) aλ 1−bλ + x λ = 2 √ ax + bx, for any a, b, x > 0.
For any w > 0, they entail the following embeddings: where λ w achieves the infimum. This leads to the standard Bernstein's inequality: By choosing w = c 0 (x + 1)/n for some constant c 0 > 0, this gives the following inequality, which says that when the variance term ϑ t is small, the sub-exponential term is dominating in Bernstein's inequality: For any 0 < v < w < +∞, we have so, together with (22) and (23), we obtain Now, we want to replace ϑ t by the observableθ t in the deviation (25). Note that the processŨ t given bỹ is a martingale, so following the same steps as for U t , we obtain that exp(Ũ t − λS λ (t)) is a supermartingale, withS Now, writing again (23) forŨ t with the fact that |H i t | ≤ 1 and using the same arguments as before, we arrive at But, if ϑ t satisfies simply by using the fact that A ≤ b + √ aA entails A ≤ a + 2b for any a, A, b > 0. This proves that so using (25) and (26), we obtain This inequality is similar to (25), where we replaced ϑ t by the observableθ t in the sub-Gaussian term. It remains to remove the event {v ≤ ϑ t < w} from this inequality. First, recall that (24) holds, so we can work on the event {ϑ t > c 0 (x + 1)/n} from now on. We use a peeling argument: define, for j ≥ 0: and use the following decomposition into disjoint sets: We have where we introduced the constants c 1,ǫ = 2 √ 1 + ǫ and c 2,ǫ = 2 (1 + ǫ)(4/3 + ǫ) + 1/3.

Some notations and preliminary results for the proof of the oracle inequalities
Let us introduce the following notations. Let We will use the notation ·, · n for the following "empirical" inner-product between to functions h, h ′ : R d → R + (two "covariates" functions): and the corresponding empirical norm: Note that with these notations, we have: To avoid any possible confusion, we will always write β ⊤ β ′ for the Euclidean inner product between two vectors β and β ′ in R M . In view of (11), we can write and (5) and (3), the following holds:

Now, in view of
where: Using Lemma 1 two times, we obtain:

Proof of Theorem 1
Recall that the empirical risk R n is given by (10). As a consequence of (28) and (29), we obtain the following decomposition of the empirical risk: so, for any β ∈ R M , the following holds: By definition ofβ, we have R n (β) + pen(β) ≤ R n (β) + pen(β) for any β ∈ R M , so: Let us introduce the event where the weightsŵ j are given by (14). Using Theorem 3 together with an union bound, we have that where c 3 is a purely numerical positive constant from Theorem 3. On A, we have where B is a convex set. This proof uses arguments from [17]. We denote by ∂φ the subdifferential mapping of a convex function φ. The function b → R n (b) is differentiable, so the subdifferential of R n (·) + 2 pen(·) at a point b ∈ R M is given by ∂(R n + 2 pen)(b) = {∇R n (b)} + 2∂ pen(b) = {2H n b − 2h n } + 2∂ pen(b).