Structured estimation for the nonparametric Cox model

: In this paper, we study theoretical properties of the non-para- metric Cox proportional hazards model in a high dimensional non-asymp-totic setting. We establish the ﬁnite sample oracle l 2 bounds for a general class of group penalties that allow possible hierarchical and overlapping structures. We approximate the log partial likelihood with a quadratic functional and use truncation arguments to reduce the error. Unlike the existing literature, we exemplify diﬀerences between bounded and possibly unbounded non-parametric covariate eﬀects. In particular, we show that bounded eﬀects can lead to prediction bounds similar to the simple linear models, whereas unbounded eﬀects can lead to larger prediction bounds. In both situations we do not assume that the true parameter is neces- sarily sparse. Lastly, we present new theoretical results for hierarchical and smoothed estimation in the non-parametric Cox model. We provide two ex- amples of the proposed general framework: a Cox model with interactions and an ANOVA type Cox model.


Introduction
Prediction of an instantaneous rate of occurrence of events when covariates are high dimensional plays a critical role in contemporary genetics studies underlying the causes of many incurable diseases. The challenge of high-dimensionality and arrival of high throughput bioinformatics data give rise to the surge of interest in the statistical literature.
Ever since Cox's seminal work on proportional hazards models [9, 10] many significant steps have been taken toward analyzing and quantifying regression estimators with censored data -notably the work of [33,1] and [35], among others. [2]'s seminal work established asymptotic properties of a class of estimators maximizing partial likelihood. It also introduced a martingale decomposition of the score vector of Cox's partial likelihood. Such martingale techniques were then further developed for a class of truncated regression models [18], additive risk models [22] and competing risks models [25]. Despite the substantial body of existing work on proportional hazards estimators, research on high dimensional proportional hazards estimators has mostly been limited to completely specified models and exactly sparse estimators [5,15]. Several recent papers have shed new light on high dimensional, but not-necessarily, sparse estimators [14,21,17], by presenting a finite sample framework for the statistical analysis when p n. The partial likelihoods studied in those papers are assumed to be finite sample versions of global convex and quadratic functions. When p n, many instances of such likelihoods will only possess quadratic curvature over small, local and dimensionality independent regions. In this paper, we present new theory that does not rely on such an assumption and we exemplify differences between the two regimes, strongly quadratic curvature and local quadratic curvature. Broadly, we are interested in allowing model-misspecification and in allowing covariates whose dimension grows together with p and n.
We consider the following nonparametric Cox model. Conditional on the p-dimensional covariate x, the hazard function is modeled as for a baseline hazard function λ 0 (t) and the relative risk function g(x). In order to estimate g when the dimensionality of the covariates p is much greater than the number of samples n, it is commonly assumed that function g(x) exhibits some form of sparsity. In the Cox model g(x) = β * T x for an exactly sparse vector. However, the linear modeling assumption on the log hazard ratio does not admit a good interpretation when g(x) is nonlinear in nature.
To handle this challenge, we divide the function g(x) into an additive and nonadditive component, where the additive component can be well approximated through a sparse structured additive function. We aim to solve the following problem min where B is a convex set, f β is a parametric approximation of g. To the best of our knowledge, previous literature on the proportional hazards regression all focused on bounded sets B. On the contrary, we consider a convex set B that has diameter dependent on the dimensionality p, which grows exponentially with the sample size n. In practice, the relative risk function g is unknown and it is impossible to perfectly solve (2). Our goal is therefore to recover an approximate solution to this problem. That is, we wish to construct an estimatorβ such that with a constant K ≥ 1 that does not depend on p and is as small as possible. An inequality that provides an upper bound on the (random) quantity in the above display, in a certain probabilistic sense, is referred as an oracle inequality later on. Observe that this is not an additive model since we do not assume that the risk function g is of the form f β for some β ∈ B. Consequently, the bias term min β∈B f β − g 2 may not vanish and the goal is to approximate the structured combination with the smallest possible bias.
There is a vast literature on establishing oracle inequalities for (3). Optimal rates of estimation for Gaussian linear regression can be found in [36]. Approximate sparse models, can be handled for both the linear and generalized linear case [38]. Lack of unique sparse modeling of the relative risk in our setting, poses unique theoretical challenges. Both the gradient and the Hessian of the partial likelihood of model (1) are indexed by β. As the true relative risk does not have parametric form, both the score and the Hessian lose its martingale structure. Such problems are similar in nature to model misspecification. When the baseline hazard is known, the covariates are fixed and bounded; [21] developed bounds using the additive Lasso penalty. Existing literature on the theoretical analysis of the proportional hazards model (1), does not address model misspecification that can be structured in nature, or that can allow random designs of size that depend on p. We approach this problem by introducing a very general class of penalty functions.
Existing literature that establishes oracle inequalities for (3), typically assumes that the loss function satisfies a quadratic margin behavior (see Assumption B of [38] or Condition 2 of [32]). Such an assumption is related to the strong convexity of the likelihood function. Using the same assumption, [17] showed oracle inequalities using Kullback-Leibler loss for the additive Cox model. In context of additive hazards models, [14] designed a loss function that satisfies quadratic margin and use fixed, bounded design to establish oracle inequalities. However, in proportional hazards models (1), when the diameter of the set B grows with p, such quadratic margin assumption can be easily violated. Moreover, new theoretical challenges arise when departing from the quadratic margin; the flat geometry of the loss function prevents the access to the informative oracle bounds. In the context of linear models, [31] developed restricted strong convexity arguments. For the nonparametric Cox model (1), we develop concentration arguments, to show that the geometry of the loss does play a significant role only in small, ellipsoidal neighborhoods of the sparse, non-degenerate models. Within such neighborhoods, we sandwich the loss function with two quadratic losses and analyze the two losses independently.
When p n, it has been established [20,7,24] that vectors that maximize the likelihood over either a finite, sparse set or its convex hull can achieve oracle inequalities (3). Censored high dimensional data are often collected from clinical studies where genomic formations are highly complex with a large number of possible interactions. Despite its importance, the structured sparsity is rarely studied in the context of censored observations. [40] gave an interesting empirical study in the cases of p ≤ n. This article focuses on censored data with structured (group or hierarchical) sparsity. In particular, our results easily extend to situations where group LASSO, hierarchical LASSO, group Ridge, Elastic Net and block l 1 /l ∞ penalty are employed.
The contributions of our paper are three fold. First, we establish two new oracle inequalities (OI) for the high-dimensional nonparametric Cox model (1) that explicitly bound the squared estimation error under a random design. Second, we develop techniques that allow deviations from the exact sparsity and that introduce model misspecification. Third, we show new bounds for hierarchical and smooth selection in the context of additive Cox models. In particular we discuss the complete CAP family as introduced in [43] and penalties based both on sparsity and smoothness constraints, the elastic net penalty [45], for example.
The rest of the paper is organized as follows. In Section 2, we define a new class of group penalty functions. Section 3 contains our theoretical results. New bounds on the distance between the least squares and the partial likelihood loss function are presented in Section 4. Section 5 is left for two examples of the Hierarchical Lasso and the Elastic Net penalties used in the Cox model with interactions and an ANOVA Cox model, respectively. We use the following notation. A pd dimensional vector x is represented as The Höelder conjugate of γ j is denoted by γ * j and satisfies 1/γ j + 1/γ * j = 1. The Euclidean functional norm, · 2 , is defined as f (X) 2 = 1 n n i=1 f 2 (X i ). Throughout the paper, we use ⊗ to denote the outer product between vectors, that is x ⊗2 denotes xx T , for any vector x.

Convex group selection
We begin by setting up the notation behind point process models. Let T denote the event time, let D denote the censoring time, and X = (X T 1 , . . . , X T p ) T de-note the pd-dimensional covariate vector with X j = (X j1 , . . . , X jd ). Define Z = min(T, D) and δ = 1{T ≤ D} as the observed event time and censoring indicator, respectively. We consider an i.i.d sample {(X i , Z i , δ i ) : i = 1, . . . , n} from the population (X, Z, δ), where X i = (X T i1 , . . . , X T ip ) T and X ij = (X ij,1 , . . . , X ij,d ) T . Let the event time, T , and the censoring time, D, be independent conditional on the covariates. We denote with t 1 < · · · < t N the ordered failure times and with R q = {i ∈ {1, . . . , n} : Z i ≥ t q } at risk set at each failure time t q . We define counting processes where g(x) is the unknown function of interest. Moreover, we use Λ 0 (τ ) = τ 0 λ 0 (t)dt to denote the integrated baseline function.
To approximate g(X), we define two collections of functions, the first includes univariate functions {f 1 (x), . . . , f p (x)} whereas the second includes a collection of candidate dictionary functions {Ψ 1 (x), . . . , Ψ d (x)}. Examples of dictionary functions include wavelets, splines, step functions, frames etc. We aim to approximate g(X) with a linear combination of univariate functions f j , each of which we approximate with a linear combination of dictionary functions Ψ k (x). Specifically, we approximate g(X) with where similarly as before we used Ψ(X i ) = Ψ(X i1 ) T , . . . , Ψ(X ip ) T T with Ψ(X ij ) = (Ψ 1 (X ij ), . . . , Ψ d (X ij )) T . The candidate functions are known a priori with |Ψ k (x)| ≤ C < ∞, but need not be orthogonal. Note that we do not make assumptions on the number of candidate functions d or p and we allow both to grow and be much larger than n.
Let τ denote the end of the study time, we define the empirical risk function R n (b) = −L n (b, τ ) and L n (b, τ ) denotes the log partial likelihood associated to the additive component using the counting process notation: with We denote population equivalents of S (l) to denote the censored empirical average of the unknown hazard function. Later on we denote L n (b, τ ) as L n (b) for simplicity.
We fix some vector β * ∈ B such that β * = (β * 1 , β * 2 , . . . , β * p ) T , β * j ∈ R d and are such that β * j = 0, for j ∈ M * , β * j γj = 0, j ∈ M c * . The set M * is any subset of {1, . . . , p} that has at most s elements, i.e., |M * | ≤ s. Such a vector β * possesses structured or grouped sparsity. We choose vector β * among all structured-sparse vectors, such that f β * is the closest, in the Euclidean distance, from the unknown function g, i.e. such that f β * − g 2 = min b f b − g 2 . Notice that f β * − g 2 = 0 if and only if f β * = g, i.e. if the hazard risk has additive structure. In general the bias term min β∈B f β − g 2 does not vanish, and our goal is to imitate the structured vector β * . The Cox regression model is a very flexible tool when analyzing the effect of several "risk" factors on time to event problems. Very frequently a "risk" factor may have several levels and can be expressed via a number of dummy variables [26]. The dummy variables corresponding to the same factor form a natural group that we would like to preserve at estimation. If a parameter of one of the factor levels enters a model, we would like to encourage other associated factors to enter the model. Additionally, it is of interest to determine the role of the interactions between various "risk" factors on the outcome of the patients with familiar event of interest [19]. Such studies require hierarchical models with multiple interaction terms. If a parameter of one of the interactions enters the model, then it does not force the main effects to be present in the model. Hierachical penalties have a more suitable format that forces main effects to be present in the model if the interactions are.
With this in mind, we consider a class of estimators β that solve the following penalized problem where we define the group penalty function (GPF) as with a convex function ρ. We consider convex functions ρ that are sparsity encouraging and that satisfy ρ (0+) > 0. The scaling, d 1/γ * j , ensures that the penalty term and the number of parameters within each group are of the same order. The GPF includes a wide variety of grouping and hierarchical structures: ρ determines how groups relate to one another, while {l γj } p j=1 norms dictate the relationship of the coefficients among each group, j. For ρ = l 1 and any γ j , the penalty function reduces to the CAP family of [43]; for ρ = l 1 and γ j = 2, it becomes the group Lasso penalty of [42]; for ρ = l 1 and γ j = ∞ it reduces to the block l 1 /l ∞ penalty of [30]. The problem can be reparametrized to include a variety of scaling factors in the penalty function. For example, with proper weights R j , or [27]. In section 5, we discuss these cases in details.
The following property of the introduced GPF is important in establishing finite sample bounds. We leave the proof to the Appendix.
Then, if all the events E n,j hold with j = 1, . . . , p, we have that the GPF family (7) with convex functions ρ satisfies

Main results
In this section, we present the main results and establish the non-asymptotic oracle inequalities of β in terms of the l 2 prediction error. Our results differ from the previous literature in terms of the penalty function and the measure of prediction error. We present non-asymptotic prediction properties that allow the number of covariates to diverge with n while allowing complicated group structures in the model. Most of existing theoretical derivations in literature are based on the assumption of bounded covariates. More precisely, we define a constant M p such that We note that M p is often bounded random quantity, especially in studies where the dimension of the covariates is considered as fixed, or the function f b is bounded. In high dimensional settings where p ≥ n, M p could diverge with p and n and should be carefully considered. For example in cases where B is a p-dimensional ball of diameter r and X i s are i.i.d. standard Gaussian, then log M p = r 2 log p/n and is unbounded for all r such that n 1/4 = o(r). Moreover, most of finite sample studies rest on a fixed design setup, a condition rarely satisfied in large genomics studies in the presence of censoring. Most of this paper is dedicated to develop theory that allows M p in equation (9) to diverge with p in a random design setting. We present two finite sample results, where the first is rested on assuming bounding M p (Theorem 1) whereas, the second isn't requiring such a condition (Theorem 2).
To present the results we define .
The gradient and the Hessian of the log partial likelihood are of the form: Remark 1. We note that L n (β * ) can be decomposed as In the Cox models where there is a unique true parameter β * , the last term in the above display is zero. In our case, however, that term does not disappear as the compensator does not vanish.
The following condition replaces the conditions used in the asymptotic analysis of the estimation properties of the Cox model in [11], where p < n, and those presented in Condition 2 of [5], where p > n. Condition 1. The random variables X 1 , . . . , X n are independent, identically distributed random variables such that (i) the nonparametric function of interest satisfies E exp{g(X i )} < ∞, (ii) the process Y (t) is left continuous with right hand limits and such that D := P (Y (τ ) = 1) > 0 and Λ 0 (τ ) < ∞.
Before we state the main oracle inequality, we provide concentration of measure for the gradient of the log partial likelihood at the sparse vector β * . To that end, we need a preliminary result providing concentration of measure for the vector E n (β * , t).This is summarized in the following Lemma.
Lemma 2. If Condition 1 is satisfied, then there exists a constant W > 0 independent of p, n, and d, such that for every sequence of positive numbers r n , for log u = β * 1 and c = 1 + 2 exp{m * C − log D + C log u}, with m * being the minimal signal strength defined as m * = min{ β * j γj : j ∈ M * }. Recall that, functions Ψ k are such that |Ψ k (x)| ≤ C, for a positive constant C < ∞.

Remark 2.
For fixed dimension p, it is typically assumed in the literature (see [11]) that there exist population functions s (l) (β, t), l = 0, 1, 2 such that when n → ∞. For growing dimension p, [5], take on a relation of the previous condition and only assume the convergence to hold for the true parameter β * . Since we deal with diverging p, s, and aim to provide finite sample bounds, conditions of [5] are unsatisfactory in our case. Lemma 2 proves an exact rate of convergence without imposing any of the asymptotic conditions like (10).
The next result gives tail probabilities that will be used to control the approximation error. They both depend on the GPF and require nontrivial proofs. Our theoretical derivations are further complicated due to the lack of martingale structure in the score vector L n (β * ). We have the following result: and a sequence of positive numbers λ n and all j = 1, . . . , p, for a truncation value y such that Moreover, there exists a constant W > 0 independent of p, n, and d, such that for C λn,n,p,d defined as For clarity of exposition, the proof is relegated to the Appendix B.
Remark 3. The result of (11) shows that the penalty term absorbs the misspecification term that emerges due to lack of a unique model. Additionally, the result (12) controls the martingale part of the score vector L n (β * ). The proof is challenged by the lack of typical assumption (10) and M p < C 0 , with M p defined in equation (9), that is used to control the jump size and variation of the martingale. Instead, we develop finite sample tail bounds for both the jump sizes and predictable variation of the martingale h n (β * ).
Let A denote a pd × pd matrix. To establish the sparse oracle inequality, we define the Restricted Eigenvalue constant RE(µ, s, ρ, γ, A) as follows: where for M * = {j ∈ {1, . . . , p} : β * j γj = 0}, |{M * }| ≤ s. The set C µ,ρ consists of all vectors that have support similar to the sparse vector, β * . In particular, vectors with more than s non-zero elements also belong to the cone C µ,ρ . We only require that their components positioned outside of M * are smaller in size than their components positioned inside M * . For [24]. Its geometry changes with the penalty function, ρ, and the chosen γ j s.
Thus, we use the notation RE(µ, s, ρ, γ, A) to describe its dependence on the sparsity size, s, the choice γ = (γ 1 , . . . , γ p ) T , the vector of norms used to describe the "smoothness" of each f j and the choice of the matrix A. For the linear models, A takes the form of X T X for fixed designs [4] and Σ for random Gaussian designs with covariance matrix Σ [8]. For the parametric Cox model where g(X i ) = β * T X i , [15] consider the case of A = − 2 L n (β * ) = n −1 n i=1 τ 0 V n (β * , t)dM i (t) for both fixed and random designs. We postpone further discussion of this constant to the Appendix A.
Let us introduce two constants 0 ≤ υ 1 ≤ 1 and 0 ≤ υ 2 ≤ 1 satisfying These constants are used to bound the size of the GPF functions.
Lemma 4. Letβ be defined as in (6) with penalty function GPF defined in (7). Let Condition 1 hold and let ζ = ζ(s) be a positive constant. Then, with probability 1 − δ, for δ in (19) and all b ∈ B, We also define a sequence of weight vectors ω(b) = (ω 1 (b), . . . , ω n (b)) T as follows, With these preparations we are ready to state the main result for the case of bounded covariate effects. Let M p be defined in (9) and let E 0 be the event such that M p < C 0 , for a constant C 0 independent of p or n. Theorem 1. Letβ be defined as in (6) and penalty function P (b) defined in (7). Let ζ = ζ(s) be a positive constant. Then, for any non-negative constant A > 0 and log u = β * 1 , with for θ, y, M, u, m * as in Lemma 3, andd = j∈M * d 2/γ * j , 0 ≤ υ 1 ≤ 1 and 0 ≤ υ 2 ≤ 1 satisfying (16) and Proof of Theorem 1. This proof requires careful analysis of the possible model misspecification and uses results from Propositions 2 and 3 stated in Section 4. To that end, we define an empirical functional norm · n,b * for functions for nonnegative weight process . This norm is connected to the curvature of the partial likelihood and is further discussed in Section 4. From the Taylor expansion and some algebra, we have that the following representation holds for all b: where in the last expression we used the Doob Mayer decomposition From the definition of the penalized estimator as the minimizer of penalized empirical risk in (6), we obtain −L n ( β) According to (22) we decompose ν n (b,β, g) in two parts, one that can be tied up with the estimation error and another that can be tied up with the penalty term. To that end, we observe that n (b, t) at the true, unknown function g(x) and with λ 0 (τ ) denoting the value of the baseline hazard function at the end of the study time τ . Observe that log S (0) n (b, t) is a positively weighted log-sum-exp function for any value of b, therefore it is Lipschitz continuous (with constant 1 with respect to the l ∞ norm), Furthermore, utilizing Condition 1 we obtain Combining with the previous result, we get for any b and b * , bβ fixed and defined as before. To establish oracle inequality we need to tightly control the last three terms on the right hand side of the previous inequality. The first of those is a martingale score vector at the additive part, the second measures model misspecification, whereas the third quantifies the size of the penalty function. Model misspecification is controlled by the penalty term. To that end we use the result of Lemma 1. Utilizing Lemma 1 with ∆ = β − b and v n = 2h n (β * ), the following holds from the first equality in (8) on the event E n defined as Moreover, utilizing Lemma 1 again, but now with ∆ = β − b and v n = 4γ n Ψ(X i ), the following holds from the second equality in (8) on the event D n,i defined as After rearranging the terms and noticing that |∆ and with it that 4γ n max 1≤i≤n |∆ T Ψ(X i )| ≤ λ n P (∆), on the event D n,i .
Recall that E 0 is the event that M P < C 0 for a random element M p defined in (9) and a constant C 0 independent of p. Therefore, we can combine (24) with (25) and (28), to obtain that for all b, conditionally on the event the following inequality holds Secondly, we control the penalty term in (29) in Lemma 4 whose proof is presented in Appendix D.
Utilizing further the bound between the norms (proved in Proposition 3 in Section 4) in combination to (29), we obtain with υ 1 , υ 2 defined above in (16). Moreover, from the definition of the vector β * and the triangular inequality we have which in combination to the previous inequality provides The theorem follows easily if we bound the probability of the event E 0 ∩ E n ∩ n i=1 D n,i , which is given in Lemma 3. Hence, the proof is completed.

Remark 4.
A typical assumption in the literature of oracle inequalities of the type (18) requires profile likelihood to be bounded below with a quadratic function (see quadratic margin condition of Assumption B of [38]). Instead of assuming such a margin, we directly derive lower and upper quadratic processes that sandwich the partial likelihood process. In case of Gaussian linear models, these two quadratic processes equal to the typical l 2 loss function.
Theorem 1 establishes new finite sample oracle inequality with possible deviations from exact sparsity. The first term on the right hand side of (18) measures how far is the true function of interest g(x) from the sparse additive approximation f β * and is only equal to zero if g = f β * almost surely. Typically, similar results appeared in problems with fixed design [17] or if one considers estimation errors related to the Kullback-Leibler divergence that are quadratic in nature [14,21]. In contrast, our results hold for a log partial likelihood of non-quadratic type and a class of general random designs and general group penalties. The last two quantities of the RHS of (18) represent the convergence rate for the appropriate choices of λ n .
We now comment on the size of the constant ω −1 appearing in the bound (18). Each weight, ω i (b), is a sum of the conditional probabilities that observation i had an event at time t q , given that at least one event occurred at time t q . Proposition 1. Let η > 0 and c 2 ∈ R be constants such that for all q = 1, . . . , N, where I pd is a unit matrix. Then, for all i ∈ {1, . . . , n} and b n > 0, the solution to the optimization problem is attained and the minimum ω min satisfies The conditions of Proposition 1 are not restrictive and are easily verifiable for well posed problems. For κ i = min{vΨ ⊗ (X i )v T , v 2 ≤ 1, v ∈ C 7,ρ } and by Cauchy's interlacing theorem of Hermitian matrices, for Propositon 1 to hold it suffices that the random covariates X i satisfy min i∈Rq κ i > 0. In that case, we conclude that ω satisfies δ ω ≥ For κ q = min{ i∈Rq vΨ ⊗ (X i )v T , v 2 ≤ 1, v ∈ C 7,ρ } we obtain an upper bound on the leading constant of Theorem 1, with the right-hand side bounded away from infinity almost surely.
Remark 5. Note that if P (M p < C 0 ) → 1, the proposed estimator achieves Gaussian-like oracle rates similar to those of penalized least squares methods used in Gaussian linear models (see further discussion in Section 5). In other words, the first term in (18) is negligible and the rate is driven by the last two terms.
In the next result, we take a novel approach and establish high-dimensional sparse oracle results for possible unbounded covariate effect, i.e. without requiring Condition 9.
Remark 6. The proof of the rest of the section consists of two parts. First, we localize our penalized estimator to a small elliptical neighborhood around β * . With an appropriate choice of the tuning parameter, λ n , we show that the diameter of the convex neighborhood becomes independent of the dimensionality and it shrinks to zero asymptotically (see Lemma 5). Second, using such local neighborhood structure, we sandwich the partial likelihood process with a lower and upper quadratic processes as in Theorem 1. The proof is completed by the analysis of those lower and upper bounds.
The localization step is presented in the following Lemma 5.
Next, we present the main result of this section.
Theorem 2. For log(pd) ≤ n letβ be defined as in (6) and penalty function P (b) defined in (7). Let Condition 1 hold and let ζ = ζ(s) be a positive constant. Then, for non-negative constant A > 0 and u defined in Theorem 1, with probability no less than 1 − δ, δ > 0 and satisfying (19), there exists ε > 1 such that, Proof. We proceed by first restricting the parameter space to an elliptical neighborhood that is not expanding with dimensionality p. Then, we apply Lemma 4 (stated in the Proof of Theorem 1) and Proposition 3 (stated and proved in Section 4) to finalize the proof. From Proposition 2, we have that The exponential term in the previous equation needs to be tightly controlled for p n. The proof of the theorem is then finalized by finding nontrivial bounds for the empirical norms, · n,· as defined in (20), while allowing p n. Let p ≥ n and log p ≤ n. We establish that by bounding the appropriate norm of the error vector β − β * and obtaining the bound, which is log linear in dimensionality p. The result is summarized in Lemma 5, whose proof is given in Appendix D.
Consequently we have Remember that C ≥ max k,i,j |Ψ k (X ij )|. Hence, we have successfully localized the error vector β − β * in a convex neighborhood whose diameter is not increasing with the dimensionality p. Utilizing further Lemma 4 and Proposition 3 with equation (29), we obtain with υ 1 , υ 2 defined above in Lemma 4. The proof is completed by applying the triangle inequality.
Let us comment on the size of ε appearing on the RHS in Theorem 2. If s ≤ log n, with the help of Proposition 1, we can conclude that ε ≥ 0 and where r n → 0 and is such that r n ζ 2 = λ n j∈M * d 2/γ * j with the constant C defined as the upper bound on the dictionary functions Ψ.
Remark 7. The difference in the rates of convergence between Theorems 1 and 2 reflects the dimensionality and geometry of the problem. In comparison to Theorem 1, Theorem 2 differs in the presence of the exponential term of the order of e rn . This additional term is coming from the complex likelihood structure for possibly unbounded covariate effects for which we establish that local Lipchitz constant is proportional to e rn .
Results of (34) and (35) imply dimensionality restrictions on the hazard regression problem (6). Results of Theorem 2, rely on the choice of the number of basis functions, d, and it requires d −1 log(pd) < √ ne − β * 1 /s. This on the other hand, implies certain bound on s, p, n and d. In the case of d = O(n 1/2 ) we have s log p n e β * 1 + s log n 2n e β * 1 < 1, which is more restrictive than the bound appearing in linear regression models with s log p/n < 1 [4]. We also note that previous results do not require exact sparsity to hold, that is, they do not assume β * is the true underlying parameter. In summary, the results of Theorems 1 and 2 are quite general. They cover a wide range of penalty functions with a choice of γ j 's and are applicable to Lasso, group Lasso, group ridge, CAP penalty, elastic net and many more. Two specific examples will be discussed in Section 5.

Sandwich bounds for the log partial likelihood
This section gathers some results that were crucial in obtaining Theorems 1 and 2. The novel ideas of the major results are to quantify the distance between the log partial likelihood −L n (b) and the approximate quadratic expansion of the log partial likelihood.
Without loss of generality, −L n (b) can be written as R n (b) = −L n (b) + L n (β * ) − L n (β * ). By Taylor expansion around β * , we have that there exists a c ∈ (0, 1) and b * = cb + (1 − c)β * such that Together with the previous Taylor expansion, the empirical risk function can be decomposed as follows. For every b, there exists a c ∈ (0, 1) and b * = cb + (1 − c)β * such that R n (b) admits the following quadratic representation: Because no two counting processes, N i (t) and N j (t), jump at the same time, the following holds: can be understood as a process of empirical weighted averages of f b . If Condition 1 is satisfied, then, there exists a c ∈ (0, 1) such that the introduced empirical norm is a proper norm. To be specific, the norm is nonnegative, f b 2 n,b * = 0 for every such b * if and only if b = 0. In addition, the norm satisfies the triangular inequality that Since the squared Euclidean norm · represents a natural benchmark, we seek to understand the lower and upper bounds of the · n,· norm using the l 2 empirical norm · in the next result.

Proposition 2.
Let ω be defined as in Theorem 2. For any vector v define Then, the following sandwich bound holds almost surely for every vector b and corresponding vector b * = cb + (1 − c)β * , uniformly for every c ∈ (0, 1).
A similar result appeared independently in the recent work of [15] (see Lemma 4.3). Such a result shares similarities to self-concordant arguments of [3], but the last arguments do not cover cases of p n.

Proposition 3.
Let N represent the number of distinct events. Then, uniformly for every b ∈ [−b n , b n ], with b n > 0 satisfying the condition of Proposition 1, and b * = cb + (1 − c)β * , with c ∈ (0, 1), the following holds almost surely: (39) Moreover, if b n is bounded and min 1≤i≤n λ min Ψ(X i )Ψ T (X i ) > 0, then the left-hand bound in (6) is strictly positive almost surely.
Propositions 1-3 are critical in establishing the main result in terms of nontrivial lower and upper bounds. We utilize Propositions 1 & 3 for low-and 2 for high-dimensional problems, respectively. With the help of all four results, we are able to obtain the main results in Section 3.

Examples
In this section, we present two examples of GPFs (7) (that allows hierarchical structures within and among groups) and establish their theoretical properties in the Cox model setup. To the best of our knowledge, similar results do not exist in the current literature. For simplicity in the presentation, the results of this section focus on the exact sparse models with β * representing the unknown true parameter.

Hierarchical selection and CAP
Our results apply to a general class of additive models, where the groups in the additive Cox model may share some, but not necessarily all features across groups. For example, the effect of one gene can be shared by many different pathways, and thus studying hierarchical gene selection is of significant importance.
One model that has very specific hierarchical structure is a Cox model with pairwise interaction terms. Given the covariates, the hazard rate function λ(t|X) of each patient is related to the covariates X 1 , . . . , X p and their cross products X j X k with the following model where Θ = Θ T ∈ R p×p . In the display above, we refer to the additive part as the main effect terms and the quadratic part as the "interaction" terms. In this model, the total number of parameters is 1/2(p 2 + 3p) and each f j takes on a bilinear form β j X ij + p k=1 Θ jk X ij X ik with b j = (β j , Θ T j ) and Ψ(X ij ) = (X ij , X ij X i ).
More generally, going back to the notation of previous chapters, let all covariates be decomposed into G possibly overlapping groups {Γ j } G j=1 in such a way that j Γ j = {1, . . . , p} and Γ j ∩ Γ k = ∅, for j = k. Then, each f j can be approximated by b T Γj Ψ Γj , where Γ j is a set of covariates that belongs to group j. In previous example the set Γ j was {j, 1, . . . , p}.
The regularized estimator, β, is then defined as the minimizer of penalized partial likelihood PL n (b) + P (b), where with the penalty function P (b) defined as where |Γ j | denotes for the cardinality of that set. Note that this penalty includes the classical group Lasso penalty, where one would select all γ j = 2.
Corollary 1. Let conditions of Theorem 2 be satisfied. Then, for some constant A > 4 and the choice of the tuning parameters we have for r n = ζ −2 j∈M * λ n,j |Γ j | 2/γ * j , and 0 ≤ υ 1 ≤ 1 satisfying The proof of this result is omitted because it is a simple modification of that of Theorem 2 with λ n being adaptive to each group Γ j . The oracle inequality of Corollary 1 discusses finite-sample properties of the whole CAP family proposed in the seminal work of [43]. In particular, the block l 1 /l ∞ penalty introduced in [30] is a member of the CAP family. [30] present l ∞ bounds on the estimation error of block l 1 /l ∞ penalty in the linear models. Corollary 1 provides its finite sample l 2 error bounds for the sparse additive Cox model with possibly overlapping groups.
In more details, we obtain with high probability for the block l 1 /l ∞ penalty, 0 ≤ υ 1 ≤ 1 satisfying and r n = ζ −2 j∈M * λ n,j |Γ j | 2 . In case of a parametric model with g(X i ) = X T i β * , or the interaction model discussed in the example above, we observe that Gaussian random designs with Σ such that λ min (Σ) > 0 will make the last two terms in (40) negligible.
Moreover, non-overlapping groups gained significant attention with importance of multi-task learning [24]. Similar setup has not been investigated in models related to (1). Corollary 1, provides a finite sample bound for the multitask learning i.e. l 1 /l 2 penalty as well. In more details, we obtain with high probability j∈M * λ 2 n,j /ζ 2 and r n = dζ −2 j∈M * λ n,j . If in addition P (E 0 ) → 1, the upper bound above, up to a constant, matches the minimax rate of [24] developed for high dimensional linear models.

Smooth selection
Throughout the previous sections, we simplified the technical details and left out the smoothing component of the penalty. Although selection of groups of features is important, smoothing splines become of interest when considering non-parametric estimation. Because of knot selection, there are potential questions of stability of estimation. Adding pre-described smoothing requirements for the choice of Ψ has become a standard technique for avoiding instability.
Next we present one example of the ANOVA Cox model where the hazard rate function of interest is a nonparametric function of covariates. Consider the multivariate nonparametric regression problem where f is the function of estimation interest. A popular model for high dimensional problems above is a smoothing spline analysis of variance model and in particular its truncated version where r is the truncation term. The identifiability of the terms is assured by the side conditions through averaging operators [23]. The form of FGP penalty is similar to the common smoothing spline and allows multiple smoothing parameters for each function f jk independently. With the notation of previous chapters, in this model g j can be approximated using a set of basis functions {B q } d q=1 , as g j = d q=1 β jq B q (X ij ) and similarly g jk = d q=1 Θ jkq B q (X ij )B q (X ik ), with β jk and Θ jkq as the unknown parameters. In this model, the total number of unknown parameters is 1/2(p 2 + 3p)d and each f j takes on a bilinear form d q=1 β jq B q (X ij ) + d q=1 p k=1 Θ jkq B q (X ij )B q (X ik ). Hence, in notation of the previous chapters, b jk is now a vector b j = vec(β jk , (Θ j1k , . . . , Θ jpk ) T ) and Ψ k (X ij ) = vec(B k (X ij ), B k (X ij )(B k (X i1 ), . . . , B k (X ip )) T ). In the display above, vec(A) = (A 11 , . . . , A 1p , . . . , A k1 , . . . , A kp ), for a matrix A.
Let the smoothing matrix, M j ∈ R d×d , contain the inner products of the second derivatives of the B-spline basis functions, i.e., k, l = 1, . . . , d, and R j ∈ R d×d is a matrix obtained from Cholesky decomposition of M j . More generally, we show that the work of the previous sections extends to this situation with only a few adaptations. Let us define the penalized smoothed estimator as for a convex and subadditive choice of ρ. Then, we can rewrite the problem as A crucial part of extending the previous results to this novel setting requires extending the results of Lemma 1 and Propositions 2 and 3 to the new penalty structure. Details of the proof are presented in the Appendix D. To state the results we define a set Structured estimation for the nonparametric Cox model 515 for a new score vectorh n,j (β * ) = − 1 n n i=1 τ 0 ( E n,j (β * , t)−R −1 j Ψ(X ij ))dM i (t) and Let N represent the number of distinct events and let Then, on the event T n , the penalty function P Moreover, the following two statements hold almost surely: (ii) Let a v be a constant defined in Proposition 2. Then, the following sandwich bound holds almost surely for every vector b and corresponding vector b * = cb + (1 − c)β * , uniformly for every c ∈ (0, 1).
With the help of the results presented in earlier sections and this Lemma, we have the following Corollary.
Corollary 2. Let conditions of Theorem 2 be satisfied. Let M j be well defined with λ = min 1≤j≤p λ min (R j ) > 0. Then, for some constant A > 4 and the choice of the tuning parameters for r n = ζ −2 sλ n d, and 0 ≤ υ 1 ≤ 1 satisfying The result above is a finite sample one on prediction properties of a nonparametric smoothing estimator for the high-dimensional Cox model. A particular example of a smooth selection is the Elastic net penalty [45]. Although our previous results easily apply to this penalty (by specifying γ j = 1 and ρ = l 1 ), its efficient implementation in the Cox model was only recently proposed in [41], but its theoretical properties have not been previously studied. Although tackled as the last problem, the importance of the obtained finite sample bounds for smooth selection lies in the inadmissibility of such results with techniques that already exist in the literature. In particular, in the case of Elastic-Net penalty we obtain with high probability for r n = ζ −2 sλ n , and 0 ≤ υ 1 ≤ 1 satisfying υ 1 e −2Cυ1 ≤ 16λ 2 n s ζ 2 ρ (0+). Such a result holds under the assumption that the random design X is such that last two terms of the equation (41) are negligible.

Discussion
In this paper, we propose a new method for analyzing the theoretical oracle risk properties of likelihood functions that are not necessarily of a quadratic nature. By sandwiching the likelihood with two other processes, we establish that it is sufficient to analyze the risk properties of the bounding processes alone. To the best of our knowledge, minimax rates, have not been established for any survival model so far despite their importance. Equivalents of traditional information theoretic tools, such as Fano's lemma, are not easy to understand in the Cox model setup. Our proposed method of sandwiching the likelihood with two quadratic likelihoods may be useful in establishing minimax rates.

Appendix A: The restricted eigenvalue condition
The restricted eigenvalue condition, RE(µ, s, ρ, γ, A), defined in (12) represents a generalization of the cone constraint condition that appears in work on Lasso problems [4]. Equivalent definitions were proposed for various hazard rate models [21,14,17,15]. We refer to [6] for comparisons of different kinds of compatibility and restricted eigenvalue conditions and their relationships for sparse linear models. The usual scaling factor of √ n disappears in the definition of the restricted eigenvalue condition because it is included in the definition of the empirical norm, f (·) 2 n,· . Compared to the RE condition in [4], the denominators differ in that the l 2 norm is replaced with an l 1,γ norm. In the least squares procedures, 2 L n (β * ) = −X T X and the restricted eigenvalue conditions are defined on the eigenvalues of X T X. Condition (14) can be seen as a rescaling of the minimum eigenvalue problem in the classical RE condition needed for the complex likelihood structures.
Determining the class of matrices that satisfy the RE(µ, s, ρ, γ, − 2 L n (β * )) condition is an important open question. Heuristically we can argue in the following manner. First, we observe that with respect to time, τ 0 V n (0, t)dN (t) has a martingale structure. With respect to β * , it is a function of the matrix n i=1 n q=1 Ψ T (X i )Ψ(X q ). Using Condition 1 and the boundedness of the Ψ functions, matrix τ 0 V n (0, t)dN (t) will belong to a random matrix ensemble with sub-Gaussian tails, studied in [44]. Dependence through time was established not to be essential in [15], where a lower bound for RE was shown to be independent of time. Moreover, we can combine both results to conclude that for large enough sample size, there exists a positive constant ζ 1 such that with overwhelming probability min ∆∈Cµ,ρ,∆ =0 for all random designs X with bounded moments.

Appendix B: Preliminary lemmas
The following lemma provided exponential inequality for a martingale sequence and can be found in [37] as Lemma 2.1 Lemma 7. Let (Ω, F, P ) be a probability triple and let M t be a sequence of locally square integrable martingales w.r.t. the filtration F t . Suppose that |M t − M t− | ≤ K for all t > 0 and some 0 < K < ∞. Then, for each a > 0, b > 0.
where M, M t denotes predictable variation of the martingale sequence M t .
The following lemma provides an exponential inequality for a unbounded supermartingale sequence and can be found in [13] as Corollary 2.3.
Then, for any a ≥ 0, b > 0 and c > 0

Appendix C: Proofs of propositions
Proof of Proposition 1. Without loss of generality, let us represent the optimization problem (31) as a quadratically constrained minimum of the ratio of two quadratic functions of the following form Condition (30) implies that for any feasible point b, the above optimization problem is well defined. Multiplying (30) by (b T , 1) from the left and (b T , 1) T from the right results in which implies that l∈Rq exp{b T Ψ(X l )} ≥ δ( b 2 2 ) + 1 ≥ δ > 0. Let us fix an i ∈ R q for some q. Now, let us define . Then using the relation that we have that the optimal solution to (44) is equal to min{d 1 , d 2 }. By definition, d 1 is nonnegative. It remains to establish that d 2 is finite. Indeed, for every b satisfying b 2 2 ≤ r n and b T A i Proof of Proposition 2. To see that the equation (38) is correct, we adopt the following reasoning. First, note that f b − f β * 2 n,b * is equal to The upper bound follows the same reasoning, and thus it is omitted. The lower bound of the RHS of previous inequality follows by repeating the same steps as in Proposition 3 and the definition of the weight vectors, ω i (β * ), in (17), The upper bound follows directly from Proposition 3 by taking b * = β * to obtain Proof of Proposition 3. Let N denote the cardinality of the set {i = 1, . . . , n : N i (τ ) = 1}. The weight process, ω i (b, t) as defined in (21), satisfies the following normalization uniformly over b and t, For each b, there exists at least one i ∈ {1, . . . , n} such that ω i (b, t) > 0 and that for all i, for which ∃t ∈ [0, τ ], Y i (t) = 1, we have that ω i (b, t) ≤ n, for all t. Let us denote with ω i (b, t) defined as in (21). If t 1 < · · · < t N are ordered failure times and R j = {i ∈ {1, . . . , n} : Z i ≥ t j } is at risk set, then ω i (b) has the following representation: which matches the definition provided in Theorem 1 equation (17). Note that ω i ≥ 0 and ω i > 0 for i ∈ {1, . . . , n}. Using the previous notation, we have With this notation at hand, we have that Since ω i (b * ) ≥ 0, and are defined as conditional probabilities we have ω = max{ω i (b) : i ∈ {1, . . . , n}, b ∈ R pd } ≤ 1. We are then able to conclude that To obtain the left-hand side of (6), remember from previous exposition we have following the definition in (17). Hence, by centering the data so that the sample mean is equal to zero, i.e., The result of the Proposition 3 follows easily after applying Proposition 1 on the interval [−b n , b n ] and following discussion after Proposition 1.
We will prove maximal inequalities for each of the two terms in the above inequality.
First, consider classes of functions indexed by t: Since β * is a s-sparse vector we have that u = exp{ j∈M * β * j 1 }. We proceed by calculating theirs bracketing number. Noticing that previous classes are products of a class of indicator functions and a class of bounded functions we have that By direct consequence of Theorem 2.14.9 of [39] we obtain that there exists a constant W such that for every fixed k ∈ {1, . . . , d}. By replacing r with √ nr n in the first and utilizing union bound and replacing r with nr 2 n + log 2d in the second we obtain P sup P sup Second, from the definition of s (0) (β * , t) and Condition 1 (iii) we observe that there exists a constant 0 < D < 1 with D = P (Y (τ ) = 1) and with C being an upper bound on |Ψ k (x)| and m * defined as minimum signal strength in the additive component of the hazards model (1).
According to (47) and (48) we have with probability 1 2ed W 2 e −r 2 n , and according to (47) and (49) we have with probability no smaller than 1 − 1 4e W 2 e −nr 2 n . To further bound I 2 we show that |S (0) n (β * , t)| is bounded away from zero with high probability. To that end, we employ Massart's Dvoretzky-Kiefer-Wolfowitz inequality bounding how close an empirically determined distribution function is to the distribution function from which the empirical samples are drawn. Hence, Remeber that S for all t ≤ τ.
Together with (50) we have P inf With all of the above notice that with probability no smaller than 1− 1 2ed W 2 e −nr 2 n −2e −nD 2 /2 . Hence, we conclude that with probability no smaller than 1 − 3 8ed W 2 e −nr 2 n − e −nD 2 /2 .

Proof of Lemma 3.
Bounding D c n,i Recall that By simple union bound we see that First, observe that the definition of S (0) n (g, t) allows the following bound τ 0 S (0) n (g, t)dΛ 0 (t) ≤ Λ 0 (τ ) whereas the boundedness of Ψ k allows Ψ(X ij ) γ * j = ( d k=1 Ψ γ * j k (X ij )) 1/γ * j ≤ d 1/γ * j C to hold. With this in mind, we observe that Previous inequality is a tail probability of a sum of i.i.d.positive random variables where g is the unknown function of interest.By large-deviation inequality of nonnegative random variables (Lemma 8 in the Appendix B), we obtain for a sequence of non-negative numbers γ n and a truncation value y such that Eexp{2g(X i )}1{exp{2g(X i )} ≤ y}.
We will consider each term separately. First, to control υ jk 's we develop a finite sample result in Lemma 2 whose proof can be found in the Appendix D. Next, we bound |∆υ jk | and the predictable variation of the martingale υ jk . By Lemma 2, with high probability, the jumps are bounded by with w n = cr n + log d nu 2 . The predictable variation process can be bounded as follows The first term on the RHS of the above equation can be bounded above with high probability using Lemma 2 with w n . For the last term we use the result in (53) to conclude that for a sequence of non-negative numbers γ n , with probability larger than or equal to 1 − e − nγ 2 n 2θ 2 +2γn y/3 − P ( max 1≤i≤n exp{g(X i )} > y) for any truncation value y satisfying (54). Then, observe that for any three events A 1 , A 2 , A 3 , and similarly P ( Let A 1 = {|υ jk | ≥ q n },A 2 = {|∆υ jk | ≤ wn n } and A 3 = { ∆υ jk 2 ≤ τ Λ0(τ ) n √ n w 2 n γ n }. By large deviation inequality for martingales of bounded jumps and variation in Lemma 7, there exists a sequence of positive numbers q n such that P (|υ jk | ≥ q n ) ≤ 2e − nq 2 n Kqn+K 2 1 + P |∆υ jk | ≥ w n n +P ∆υ jk 2 ≥ τ Λ 0 (τ ) n √ n w 2 n γ n .
Hence, for t n = λ n d 1/γ * j ρ (0+)/4c 1 Cu we obtain Utilizing (59) and (60) we obtain a bound on the size of the set E c n as follows,