Inference under Fine-Gray competing risks model with high-dimensional covariates

Abstract: The purpose of this paper is to construct confidence intervals for the regression coefficients in the Fine-Gray model for competing risks data with random censoring, where the number of covariates can be larger than the sample size. Despite strong motivation from biomedical applications, a high-dimensional Fine-Gray model has attracted relatively little attention among the methodological or theoretical literature. We fill in this gap by developing confidence intervals based on a one-step bias-correction for a regularized estimation. We develop a theoretical framework for the partial likelihood, which does not have independent and identically distributed entries and therefore presents many technical challenges. We also study the approximation error from the weighting scheme under random censoring for competing risks and establish new concentration results for time-dependent processes. In addition to the theoretical results and algorithms, we present extensive numerical experiments and an application to a study of non-cancer mortality among prostate cancer patients using the linked Medicare-SEER data.


Introduction
High-dimensional regression has attracted increasing interest in statistical analysis and has provided a useful tool in modern biomedical, ecological, astrophysical or economics data pertaining to the setting where the number of parameters is greater than the number of samples (see Bühlmann and van de Geer (2011) for an overview). Regularized methods (Tibshirani, 1996;Fan and Li, 2001) provide straightforward interpretation of resulting estimators while allowing the number of covariates to be exponentially larger than the sample size. While they can be consistent for prediction (i.e. estimating the underlying regression function), confidence intervals cannot be consistently formulated, as firm guarantees of correct variable selection can only be established under a restrictive set of assumptions, including but not limited to the assumption of the minimal signal strength of the true parameter (Wasserman and Roeder, 2009;Fan and Lv, 2010;Meinshausen and Yu, 2009), which cannot be verified in practice. For practical purposes, it is of interest to develop inferential tools, most commonly confidence intervals and p-values, that do not depend on such assumptions and are yet able to provide theoretical guarantees of the quality of estimation and/or testing; and this is the goal of our work here.
For the purposes of constructing confidence intervals or testing significance of the effect from certain covariates, relying on a naive regularized estimation alone is not appropriate; notably, construction of confidence intervals for those coefficients that have been shrunk to zero is impossible. Zhang and Zhang (2014) and van de Geer et al. (2014) proposed the one-step bias-correction estimator, which can be subsequently used to carry out proper statistical inference. Our work here was motivated by an illustration project of how information contained in patients' electronic medical records can be harvested for precision medicine. The data set linking the Surveillance, Epidemiology and End Results (SEER) Program database of the National Cancer Institute with the federal health insurance program Medicare database contained prostate cancer patients of age 65 or older. A total of 57,011 patients diagnosed between 2004 and 2009 had information available on 7 relevant clinical variables (age, PSA, Gleason score, AJCC stage, and AJCC stage T, N, M, respectively), 5 demographical variables (race, marital status, metro, registry and year of diagnosis), plus 9321 binary insurance claim codes. Among these patients 1,247 died due to cancer, and 5,221 had deaths unrelated to cancer by December 2013. An important goal of the project was to evaluate the impact of risk factors (clinical, demographical, and claim codes) on the non-cancer versus cancer mortality, with appropriate statistical inference. Cancer and noncancer versus cancer mortality are known as competing risks in survival analysis, and cannot be handled using linear or generalized linear regression models as considered in Zhang and Zhang (2014) and van de Geer et al. (2014). Instead, we consider the proportional subdistribution hazards regression model, often known as the Fine-Gray model Fine and Gray (1999). Under classical low-dimensional setting, Fine and Gray derived the estimation and inference for the model coefficients via the partial likelihood principle, and handled right censoring by inverse probability censoring weighting (IPCW).
Considerable research effort has been devoted to developing regularized methods to handle various regression settings (Ravikumar et al., 2010;Belloni and Chernozhukov, 2011;Obozinski et al., 2011;Meinshausen and Bühlmann, 2006;Basu and Michailidis, 2015;Cho and Fryzlewicz, 2015), including those for right-censored time-to-event data (Sun et al., 2014;Bradic et al., 2011;Gaïffas and Guilloux, 2012;Johnson, 2008;Lemler, 2016;Bradic and Song, 2015;Huang et al., 2006, among others). However, regression has been little studied for the competing risks setting, with random censoring and high-dimensional covariates. The purpose of this paper has two folds: 1) to study estimators under the Fine-Gray re-gression model for competing risks data with many more covariates than the number of events; 2) to develop statistical inference procedures in this setting. To our best knowledge, no published work exists on statistical inference for competing risks data that allows highdimensional models; univariate testing was studied in Cox proportional hazards modelhowever, our construction allows for the testing of general linear hypothesis.
There are at least three significant challenges for addressing high-dimensional competing risks regression under the Fine-Gray model. The structure of the score function related to the partial likelihood causes a somewhat subtle issue with many of the unobserved factors preventing a simple martingale representation. Additionally, the structure, as well as, size of the sample information matrix renders both methodology and theoretical analysis based on the Hessian matrix problematic. Thirdly, random censoring presents non-trivial challenges in the presence of competing risks and weighting is needed which further complicates the theoretical analysis. Also, although bootstrap has been considered for inference under the Fine-Gray regression model, this approach is no longer applicable given the known problems of the bootstrap in high-dimensional settings. Development of high-dimensional inferential methods for competing risks data and under the Fine-Gray model, in particular, may have been hampered by these considerations.
In this paper, we propose a natural and sensible formulation of inferential procedure for this high-dimensional competing risks regression. We first study a regularized estimator of the high-dimensional parameter of interest and derive its finite-sample properties where the interplay between the sparsity, ambient dimension and the sample size can be directly seen. We then propose a bias-correction procedure by formulating a new pragmatic estimator of the inverse of a large covariance matrix that allows broad dependence structures within the Fine-Gray model. We find that the bias-corrected estimator is effective at capturing strong as well as weak signals, and can be used for statistical inference. This combination leads to an efficient and simple-to-implement procedure under the Fine-Gray model with many covariates.

Model and notation
For subject i = 1, ..., n in a study, let T i be the event time, with the event type or cause i ; we use the two words interchangeably in the following. Under the Fine-Gray model that we consider below, we assume without loss of generality that the event type of interest is '1', and we code all the other event types as '2' without further specification. In the presence of a potential right-censoring time C i , the observed time is X i = T i ∧ C i . We denote the event indicator as δ i = I(T i ≤ C i ). The type of the event i is observed, if the event occurs before the censoring time, i.e., when δ i = 1. Let Z i (·) be the vector of covariates that are possibly time-dependent. We focus on the situation that the dimension of Z i (·), p, is larger than the sample size n. Assume that the observed data {(X i , δ i , δ i i , Z i (·))} are independent and identically distributed (i.i.d.) for i = 1, . . . , n.
Since the cumulative incidence function (CIF) is often the quantity of interest, Fine and Gray (1999) proposed a proportional subdistribution hazards model where the CIF the p-dimensional coefficient β o is the unknown parameter of interest, and h 1 0 (t) is the baseline subdistribution hazard. Under the model (1.1) corresponding subdistribution hazard h 1 (t|Z i (·)) = h 1 0 (t)e β o Z i (t) . Throughout the paper, we assume that there exists a sparsity factor s o = |supp(β o )| for some s o ≤ n. Note that if we define an improper random variable T 1 i = T i I( i = 1) + ∞I( i > 1), then the subdistribution hazard can be seen as the conditional hazard of T 1 i given Z i (·). We denote the counting process for type 1 event as N 1 i (t) = I(T 1 i ≤ t) and its observed counterpart as N o i (t) = I(δ i i = 1)I(X i ≤ t). We also denote the counting process for the censoring time as N c i (t) = I(C i ≤ t). Let Y i (t) = 1 − N 1 i (t−) (note that this is not the 'at risk' indicator like under the classic Cox model), and r i (t) = I(C i ≥ T i ∧ t). Note that r i (t)Y i (t) = I(t ≤ X i ) + I(t > X i )I(δ i i > 1) is always observable, even though Y i (t) or r i (t) may not. Let G(t) = Pr(C i ≥ t) and let G(·) be the Kaplan-Meier estimator for G(·). Here we assume that C is independent of T , and Z. Following the notation of Fine and Gray we call the IPW at-risk process: in other words, the weight for subject i is one if t < X i , zero after being censored or failure due to cause 1, and G(t)/ G(X i ) after failure due to other causes. The log pseudo likelihood (as recently named in (Bellach et al., 2018)) that gives rise to the weighted score function in Fine and Gray (1999) for β is where t * < ∞ is the study end time. In the following, for a vector v, let v ⊗0 = 1, v ⊗1 = v and v ⊗2 = vv . We define for l = 0, 1, 2 (1.4) We then have the score function, i.e. derivative of the log pseudo likelihood (1.3), Regarding notation, let us mention that all constants are assumed to be independent of n, p and s o . We use K and ρ with corresponding enumerated subscripts to denote "big" and "small" constants. We use Q to denote intermediate terms used in the statements or the proofs of various results. We label the subscripts by the corresponding order of their appearance.

Organization of the paper
This paper is organized as follows. In Section 2, we provide the bias corrected estimator, Section 2.1, as well as the confidence interval construction, Section 2.2, for the high-dimensional Fine-Gray model. Construction of a new Hessian estimator, the cornerstone for our bias correction and variance estimation, is presented in Section 2.3. Section 3 presents properties of the developed estimator. Additional notations for theoretical considerations are presented in Section 3.1. Bounds for the prediction error are presented in Section 3.2; Theorem 1 is the main result on estimation. Section 3.3 studies the sampling distribution of a newly develop test statistics while not requiring model selection consistency or minimal signal strength. Theorem 2 is the main result regarding asymptotic distribution. There we present a sequence of intermediate results as well. We examine our regularized estimator and the one-step bias-corrected estimator through simulation experiments in Section 4, and apply them to the SEER-Medicare data in Section 5.
2 Estimation and inference for competing risks with more regressors than events 2.1. One-step corrected estimator A natural initial estimator to consider under the high dimensional setting is a l 1 -regularized estimator, where the particular loss function of interest would be the log pseudo likelihood as defined in (1.3). We note that our results are easily generalizable to any sparsity-inducing and convex penalty functions, but due to the simplicity of presentation we present details only for the l 1 regularization. That is, we consider for a suitable choice of the tuning parameter λ > 0. Whenever possible, we suppress λ in the notation above and use β to denote the l 1 -regularized estimator. In the Section 3.2, we quantify the non-asymptotic oracle risk bound and show that the estimator above, as a typical regularized estimator with p n, converges at a rate slower than root-n. This implies that for inferential purposes the bias of the estimator cannot be ignored.
Inspired by the work of Zhang and Zhang (2014) and van de Geer et al. (2014), we propose the one-step bias-correction estimator where β is defined in (2.1), Θ is an estimator of the "asymptotic" precision matrix Θ to be defined later. The above construction of the one-step estimator is inspired by the first order Taylor expansion ofṁ(·), (2. 3) The notation "≈" in the above indicates that the equivalence is approximate with the higher order error terms omitted. We aim to find a good candidate matrix Θ, such that −m(β o ) Θ ≈ I p , with I p denoting the p×p identity matrix. Note that when p ≤ n an inverse of the Hessian matrix above would naturally be a good candidate for Θ, but when p ≥ n such an inverse does not necessarily exist. We will elucidate the construction of Θ towards the end of this section.

Confidence Intervals
To construct the confidence intervals for components of β o , we need the asymptotic distribution of b. We will first establish the asymptotic distribution of the scoreṁ(β o ). With p > n, we have to restrict the space in which we want to establish the asymptotic distribution. The asymptotic distribution forṁ(β o ) is established in the following sense -for any c ∈ R p such that c 1 = 1 we have where V is the variance-covariance matrix for √ nṁ(β o ). We construct the following estimator for V: where η i and ψ i are defined as follows: We may now estimate the variance of √ nc ( b−β o ) using a "sandwich" estimator c Θ V Θ c. Therefore a (1 − α)100% confidence interval for c β o is with standard normal quantile Z 1−α/2 . Our proposed approach addresses various practical questions as special cases. First, we can construct confidence interval for a chosen coordinate β o j in β o . To that end, one needs to consider c = e j , the j-th natural basis for R p and apply the result (2.11). Generally, we can construct a confidence interval for any linear contrasts c β o , potentially of any dimension. For example, we can have confidence intervals for the linear predictors Z β o if the non-time-dependent covariate Z is also sparse so that we may assume Z 1 to be bounded. As the dual problem, we may use the Wald test statistic to test the hypothesis with H 0 : c β o = θ 0 .

Construction of the inverse Hessian matrix
Although the early works under the linear model inspire the construction here, the specifics, as well as the theoretical analysis, the latter remains a challenge. We start by writing the negative Hessian of the log pseudolikelihood (1.3): (2.13) We define . (2.14) Under the regularity conditions, to be specified later, we have Σ as the "asymptotic negative Hessian" in the sense that the element-wise maximal norm Σ +m(β o ) max converges to zero in probability. Our goal is to estimate its inverse Θ = Σ −1 = (θ 1 , . . . , θ p ) , where θ j 's are the rows of Θ. By definition (2.14), the positive semi-definite matrix Σ is also the second moment of the random vector with µ(t) defined in (1.4). The expectation of U i is zero, Hence, to estimate Θ, we may draw inspiration from the early work on inverting the highdimensional variance-covariance matrix (Zhou et al., 2011). Consider the minimizers of the expected loss functions where U j is the jth element of U , and U −j is a p−1 dimensional vector created by dropping the jth element from U . We show that the quantities γ * j and τ j defined in (2.16) can be used to construct the inverse of Σ. This is because τ 2 j can also be alternatively written as By the convexity of the target function E(U j − U −j γ j ) 2 , γ * j must satisfy the first order Karush-Kuhn-Tucker conditions (KKT) Applying (2.18) to (2.17), we have We can then define a vector θ 1 = (1, −γ * 1 ) /τ 2 1 that satisfies Without loss of generality, we may define θ j accordingly for j = 2, . . . , p, satisfying θ j Σ = e j . The matrix Θ = (θ 1 , . . . , θ p ) satisfies ΘΣ = (e 1 , . . . , e p ) = I p , therefore Θ is the inverse of Σ. We now utilize the sample form of Σ, (2.14), In particular we observe that Σ is that it can be written as the sample second moment (2.20) This form allows us to define the inverse of Σ as a regression between the vectors U i . For that purpose we define the least squares loss function as where U i,j is the jth element of U i , and U i,−j is a p − 1 dimensional vector obtained by dropping the jth element from U i . We then define the nodewise LASSO in our context to be γ j = argmin Accordingly, we use γ j and τ 2 j to construct (2.23) By the first order KKT condition, we have ( Θ Σ) j,j = 1 and |( Θ Σ) j,k | ≤ λ j for j = k.
Choosing λ max = max j=1,...,p λ j = o p (1), we achieve that Θ Σ − I p max goes to zero. The one-step estimator proposed in (2.2) with such Θ hence converges to the true coefficient β o approximately at the rate equivalent toṁ(β o ), as illustrated in (2.3). Our proposed nodewise LASSO estimator is innovative in several aspects. Given the difficulty imposed by the model, we cannot make high-dimensional inference by simply inverting the XX for a design matrix X like in a linear or generalized linear model. The log pseudo likelihood (1.3) has dependent entries. The covariates Z i (t) for i = 1, . . . , n are allowed to be time-dependent. Nevertheless, we identify for our model that the key element for the high-dimensional inference is each observation's contribution to the score, the U i 's. Our solution generalizes high-dimensional matrix inversion in a non-trivial way to complex models with censoring, non-standard likelihoods and weighting.

Theoretical considerations
In this section, we present the theory for the estimators β, b and the confidence intervals described in the previous section. We will quantify the non-asymptotic oracle risk bound for the estimator above while allowing p n with a minimal set of assumptions. Theoretical study of this kind is novel, since in the context of competing risks, the martingale structures typically utilized are unavailable and new techniques need to be developed. In particular, we show that the inverse probability weighting has a finite-sample effect that separates this model from the classical Cox model (see comments after Theorem 1). We will also establish that a certain tighter bound can be established whenever the hazard rate is bounded (Theorem 3).

Additional notation
In the following, we introduce some additional notations. The counting process martingales are essentially helpful tools in high-dimensions for establishing theory with dependent partial likelihoods. Unfortunately, the uncensored counting processes {N 1 i (t), i = 1, . . . , n} are not always observable. The observable counterpart N o i (t) has no known martingale related to it under the observed filtration F t = σ{N o i (u), I(X i ≥ u), r i (u) : u ≤ t, i = 1, . . . , n}. The Doob-Meyer compensator for the submartingale N o i (t) under the observed filtration involves the nuisance distribution of T i | i > 1. To utilize the martingale structure for our theory, we have to define the "censoring complete" filtration To relate the martingale (3.3) with our log pseudo likelihood m(β), we define its proxy with We define processes related to m(β) and its derivatives as They can also be seen as proxies to the processes in (1.4). To see that, observe that by conditioning, where is the weight with the true censoring distribution G(·). We denote their expectations as Our proxies precisely target those weighted samples, as S (l) (t, β) differs from S (l) (t, β) only at those summands with observed type-2 events.
Note that the Kaplan-Meier estimator for G(t) can be written as .
To study the convergence of G(t) to G(t), we denote a martingale related to N c i (t), the counting process of observed censoring, M c i (t). Let the censoring hazard be defined as h c (t) = −d log(G(t))/dt. Under the "censoring" filtration We use the integration-by-parts arguments (Murphy, 1994, the Helly-Bray argument on page 727) with random martingale measures, e.g. dM 1 i (t), in our proof. The total variation of M 1 i (t; w) is defined as (3.10) Since M 1 i (t; w) can be decomposed into a nondecreasing counting process N 1 i (t) minus another nondecreasing compensator t 0 Y i (u)e β o Z i (u) h 1 0 (u)du, we have a bound for its total variation (3.11) Similar conclusion also applies to M c i (t), i.e. we have a bound for its total variation As a convention, from hereon we suppress the w in the notation to keep it simple.

Oracle inequality
We first establish oracle inequality for the initial estimation error β − β o 1 based on the following set of conditions that are weaker than those in the next subsection.
(C1) (Design) With probability equal to one, the covariates satisfy (3.13) The expected at-risk process is bounded away from zero, i.e., for positive K 4 and ρ 2 inf (3.14) (C2) (Covariance) For K 4 in (3.14), the smallest eigenvalue of the matrix gap between jumps bounded away from zero, Between two consecutive jumps, Z i (t) has at most K 6 elements Lipschitz continuous with Lipschitz constant K 7 while the rest of the elements are considered to be constant.
Remark 1. Overall, the conditions above are minimal in the sense that they appear in results pertaining to the Cox model (Huang et al., 2013, see e.g, (3.9) on page 1149; (4.5) and Theorem 4.1 on page 1154).
Remark 2. We consider a finite interval [0, t * ]. Due to missing censoring times among those with observed type-2 events, we have to make the additional assumptions to control the weighting errors. Although the weighted at-risk processes ω i (t)'s are asymptotically unbiased, the approximation errors in the tail t → ∞ are poor for any finite n. To avoid unnecessary complications, we set the [0, t * ] such that we always have sufficient at-risk subjects; note that (3.14) implies that P (C > t * ) > 0.
Remark 3. We assume a finite maximal norm of Z(t). Condition (3.13) in (C1) is equivalent to the apparently weaker assumption (see for example Huang et al. (2013) This can be seen by noting that the Cox type partial likelihood for the proportional hazards model is invariant when subtracting Z i (t) by any deterministic ζ(t). Remark 4. Condition (C1) (3.14) carries two facts. First, the at-risk rate for type 1 events is bounded away from zero. Second, relative-risks arbitrarily close to zero is truncated at a finite K 4 ; this is necessary in high-dimensions, in order to rule out the irregular cases where the non-zero expectation of the relative risk is dominated by a diminishing proportion of the excessively large relative risks. The same argument applies for (C2) in which a lower bound of the restricted eigenvalue of the negative Hessian (Bickel et al., 2009) is defined.
Remark 5. We assume the smoothness of the time-dependent covariates Z(t). Subjects with observed type 2 events, remain indefinitely in the risk sets for type 1 events. For time-dependent covariates, continuity is helpful in establishing a slow growing rate of the maximal relative risks among those subjects; something that is a fact for time independent covariates. Note that the coordinate wise continuity in Z i (t) is insufficient as p grows to infinity. We propose (C3) taking into account likely practical scenarios, where the covariates are either constant, or change only at finitely many discrete time points.
Under the above assumptions, we are ready to present our estimation error result. Since the result holds in finite samples, we define a sequence of important constants first. For a ε > 0 and constants K 1 , · · · , K 7 as well as ρ 1 , · · · , ρ 4 (introduced in the conditions above) (3.17) where l = 0, 1, and In high-dimensional models an additional constant, the so called compatibility factor, plays an important role. For a positive constant ξ > 1, the compatibility factor where C(ξ, O) denotes the cone set with O denoting the indices of non-zero elements β o and O c denoting its compliment.
Theorem 1. For ξ > 1 and a ε > 0, let with Q 3 (n, p, ε) defined in (3.18). When n > − log(ε/3)/(2ρ 2 2 ) with ρ 2 given in (C1), we have under regularity conditions (C1) and (C3) that occurs with probability no less than where Q 4 is a positive constant satisfying Our proof of Theorem 1 applies to the result with l 2 -norm and general l q -norm for q ≥ 1. Namely, under the same conditions we have that occurs with probability no less than with the weak cone invertibility condition defined as A few comments are in order. For a fixed ε, Q 3 (n, p, ε) is of order log(n) log(p)/n. Thus, Theorem 1, together with Lemma 2 (see below), guarantee that for λ chosen to be of the order log(n) log(p)/n The above estimation error rate to the error rate log(p)/n of the simple Cox model (Huang et al., 2013;Yu et al., 2019), differing only by a factor of log(n). This factor is brought in by the error induced by the IPCW weights. Therefore, under the rate condition s o log(n) log(p)/n = o(1), we obtain an asymptotically l 1 -consistent regularized estimator β.
The quantity Q 1 (ε) describes the error from IPCW weights through the measurable approximation to processes S (l) , S (l) (t, β o ) − S (l) (t, β o ). A naïve bound for the measurable approximation is proportional to the magnitude of the relative risks in S (l) , naturally of the order e β o 1 K 3 e so , potentially growing in exponential rate of n if s o n a for some a > 0. Such bound grows way too rapidly to deliver any meaningful result. Observing that the summands in S (l) and S (l) at a particular index i differ from each other only when the i-th subject has type-2 event we are able to establish a significantly sharper bound. For that purpose, we develop ε-tail bound of the maximal relative risk among observed type 2 events (see Appendix Lemma B.3). The quantity Q (l) 2 (n, p, ε), involving Q 1 (ε) directly in the definition, gives the bound for the error from the measurable approximation to S (l) (See in Appendix Lemma B.5).
For the rest of this section, we provide further details on the proof of Theorem 1, as well as the technical challenges involved. We highlight two results, Lemma 1 and 2. The first establishes properties of the score vector while the second one establishes the properties of the compatibility factor (3.19).
Lemma 1. Let Q 3 (n, p, ε) be defined as in (3.18). Under Assumptions (C1) and (C3), Lemma 1 establishes that such event { ṁ(β o ) ∞ < λ(ξ − 1)/(ξ + 1)} (of interest in Theorem 1) happens with high probability. This task is not straightforward in the presence of both competing risks and censoring. The greatest challenge is the lack of the martingale property inṁ(β o ). Even if we use its martingale proxy (an approach useful in low-dimensions) as the gradient of (3.4) 2), which can be significantly amplified when the relative risks grow with the dimension.
To prove Lemma 1, we first show that the relative risks among subjects with observed type 2 events has sub-Gaussian tails. This is achieved through the argument that their CIF cannot be arbitrarily close to one; otherwise, these subjects would have probability close to one experiencing type 1 event. As the CIF is monotonically increasing with the relative risks, it is also unlikely to observe excessively large relative risks among the subjects with observed type 2 events. We then use Lemma A.3(i) in the Appendix to establish the concentration of around zero across all observed type 1 event times. Theorem 1 assumes that Pr κ ξ, O; −m(β o ) > Q 4 converges to zero for a sequence of Q 4 bounded away from zero, as sample size n goes to infinity. In Lemma 2, we show that such event happens with high probability. Using the connection between the compatibility factor and the restricted eigenvalue (van de Geer and Bühlmann, 2009), we show that κ ξ, O; −m(β o ) , the compatibility factor in the cone C(ξ, O), is bounded away from zero with probability tending to one.

Asymptotic normality for one-step estimator and honest coverage of confidence intervals
Obtaining the asymptotic normality is technically challenging. The log-likelihood has dependent summands both through the initial lasso estimator as well as the Kaplan-Meier estimator. We establish the asymptotic normality for the one-step estimator b and coverage of the confidence intervals without requiring model-selection consistency of the initial estimator. To remove the small-sample bias of IPCW, we need slightly stronger conditions than in the previous section. In this section alone, we use K and ρ without subscript to denote the constants independent of n, p and s o ; we have only one constant K n that is allowed to grow with the dimension and is therefore denoted differently.
(D1) (Design) The true linear predictors are uniformly bounded with probability one for random processes d z i (t), ∆ z i (t) and the counting process N z i (t) such that , β o d z i (t) is uniformly bounded between ±K and uniformly Lipschitz-K. Moreover, N z i (t)'s number of jumps K n = o n/(log(p) log(n)) and an intensity function h N (t) ≤ K.
(D4) (Dimension) The rows of the matrix Σ −1 are θ j /Θ j,j 1 ≤ K and sparse with sparsities We next present Theorem 2 that justifies all the proposed inference procedures in Section 2.2. For that purpose we denote the asymptotic variance ofṁ(β o ) with (3.1) and (3.9). Theorem 2. Let Θ be defined as in Section 2.3. Let V, b, Θ and V be defined as in (3.22), (2.2), (2.23) and (2.4), respectively. Let c ∈ R p with c 1 = 1 and c ΘVΘc → ν 2 ∈ (0, ∞). Then, whenever (C1) and (D1)-(D4) hold, As a result of the stronger conditions required for Theorem 2, which we will explain in more details below, we are able to achieve an improved estimation error for the initial estimator as stated in the next theorem.
For the rest of this section, we explain the assumptions and theoretical results needed for Theorem 2 summarized in Lemmas 3-7. Condition (D1) is needed whenever the model departs significantly from the linear case (van de Geer et al., 2014;Fang et al., 2017). In our case, the asymptotic normality of √ nṁ(β o ) depends fundamentally on the asymptotic tightness of √ n˙ m(β o ). As a necessary condition, the predictable quadratic variation under filtration F * t of the martingale must have a finite bound independent of the dimension of the covariates. This requires that the magnitude of the summands in (3.27) either be bounded or have light tails. Hence, we cannot allow the relative risk e β o Z i (t) to grow arbitrarily large. We next observe that (D2) is a standard assumption for the validity of the nodewise penalized regressions (2.22). Finally, note that Theorem 2 utilizes Condition (D3); a condition stronger than (C3) needed for √ n-approximation error betweenṁ(β o ) and˙ m(β o ). If we define the population versions of the nodewise components defined in (2.20)-(2.22), then the true parameters {γ * j , τ 2 j : j = 1, . . . , p} uniquely define the inverse negative Hessian Θ as described in Section 2.3. We prove this statement in the following Lemma.
Next, we discuss the properties of estimands γ j , τ j and Θ -defining components of the variance estimate.
The nodewise LASSO in (2.22), unlike van de Geer and Bühlmann (2009) that has i.i.d. entries, has dependent U i 's through the commonZ(t, β); see (2.20). Thus, our error rate takes the multiplicative form s o s max , instead of the summation s o + s max that may be expected under the generalized linear models. In general, we consider our rate to be optimal under our model.
Using Lemma 4, we can establish the approximation condition for b proposed in (2.3).
The proof uses the same approach as the initial low-dimensional result in Fine and Gray (1999). We approximateṁ(β o ) by the sample average of i.i.d. terms η i + ψ i plus an o p n −1/2 term. We note that the same approach involves nontrivial techniques in order to be valid in high-dimensions. In particular, we discover and exploit the martingale property of the term The last piece of our proof for Theorem 2 is the element-wise convergence of the "meat" matrix (2.4) in the "sandwich" variance estimator.
Putting Lemmas 6 and 7 together, we obtain the main result stated in the Theorem 2. The details of the proofs are presented in the Appendix. Throughout the proof, we rely heavily on our concentration results for time-dependent processes, which we state in Section A.2 and prove in Section B.4.

Numerical Experiments
To assess the finite sample properties of our proposed methods, we conduct extensive simulation experiments with various dimensions and dependence structure among covariates.

Setup 1
Our first simulation setup follows closely the one of Fine and Gray (1999) but considers high-dimensional covariates. In particular, each Z i is a vectors consisting of i.i.d. standard normal random variables. For cause 1, only β 1,1 = β 1,2 = 0.5 are non-zero. The cumulative incidence function is: For cause 2, β 2,1 = β 2,3 = · · · = β 2,p−1 = −0.5 and β 2,2 = β 2,4 = · · · = β 2,p = 0.5, with We consider four different combinations: n = 200, p = 300; n = 200, p = 500; n = 200, p = 1000; and n = 500, p = 1000. Note that this setup considers sparsity for cause 1 but nonsparsity for cause 2 effects. As the Fine-Gray model does not require modeling cause 2 to make inference on cause 1, we expect that the non-sparsity in cause 2 effects should not affect the inference on cause 1. The results are presented in Table 1. We focus on inference for the two non-zero coefficients β 1,1 and β 1,2 , as well as one arbitrarily chosen zero coefficient β 1,10 . The mean estimates are the average of the one-step b over the 100 repetitions, reported together with other quantities described below. We can see from the average estimates column that the one-step b is bias-corrected and that the presence of many non-zero coefficients for causet 2 does not affect our inference on cause 1.
In practice the choice of the tuning parameters is particularly challenging; the optimal value is determined up to a constant. Moreover, the theoretical results are asymptotic. These together with the finite sample effects of n p, lead to suboptimal performance of many proposed one-step correction estimators (van de Geer et al., 2014;Fang et al., 2017). Suboptimality is amplified for survival models, due to the nonlinearity of the loss function and the presence of censoring -both require more significant sample size (to observe asymptotic statements in the finite samples). In the following, we propose a finitesample correction to the construction of confidence intervals and in particular the estimated standard error (SE).
Let se( b j ; β) denote the asymptotic standard error as given in Section 2.2. As a finitesample correction we propose to consider se( b j ; b) in place of se( b j ; β), where the variance estimation based on the initial LASSO estimate β is replaced by the one-step b. This can be viewed as another iteration of the bias-correction formula. The resulting SE is therefore a "two-step" SE estimator. We report the coverage rate of the confidence intervals constructed with this finite-sample correction in Table 1 and we observe good coverage close to the nominal level of 95%. We note that with 100 simulation runs the margin of error for the simulated coverage probability is about 2.18%, if the true coverage is 95%; that is, the observed coverage can range between 95+/−4.36%. We note that the coverage is good for all three coefficients, where non-zero or zero. In contrast, results in the existing literature suffer under-coverage of the non-zero coefficients.
The last column 'level/power' in Table 1 refers to the empirical rejection rate of the null hypothesis that the coefficient is zero, by the two-sided Wald test Z = ( b j − β 1,j )/se( b j ; β) at a nominal 0.05 significance level. We see that although se( b j ; β) is used, the nominal level is well preserved for the zero coefficient β 1,10 , and the power is high for the non-zero coefficients β 1,1 and β 1,2 for the given sample sizes and signal strength.
We repeat the above simulations with different values for β 1,1 to investigate the power of the Wald test. The results are illustrated in Figure 1, where we see that the power increases with n and decreases with p as expected.

Setup 2
In the second setup we consider the case where the covariates are not all independent, which is more likely the case in practice for high dimensional data. We consider the block dependence structure also used in Binder et al. (2009). We consider n = 500, p = 1000; β 1,1∼8 = 0.5, β 1,9∼12 = −0.5 and the rest are all zero. β 2,1∼4 = β 2,13∼16 = 0.5, β 2,5∼8 = −0.5  Table 2.  Table 2 shows the inferential results for the non-zero coefficients β 1,1 ∼ β 1,12 , as well as the zero coefficients β 1,13 ∼ β 1,16 from the third correlated block that also contains some of the non-zero coefficients, and plus arbitrarily chosen zero coefficient β 1,30 . The initial LASSO estimator tended to select only one of every four non-zero coefficients of the correlated covariates (data not shown), as it is known that block dependence structure is particularly challenging for the Lasso type estimators. On the other hand, the one-step estimator performed remarkably well, capturing all of the non-zero coefficients.
Compared to the results in the last part of Table 1 with the same n and p, the block correlated covariates led to slightly more bias in b, although the CI coverage remained high. The power also remained high, although in the third block with the mixed signal and noise variables the type I error rates appeared slightly high.

SEER-Medicare data example
The SEER-Medicare linked database contains clinical information and claims codes for 57011 patients diagnosed between 2004 and 2009. The clinical and demographic information were collected at diagnosis, and the insurance claim data were from the year prior to diagnosis. The clinical information contained PSA, Gleason Score, AJCC stage and year of diagnosis. Demographic information included age, race, and marital status. The same data set was considered in Hou et al. (2018), where the emphasis was on variable selection and prediction error. Our focus is on testing and construction of confidence intervals.
In the following, we consider 2000 patients diagnosed during the year of 2004. The only cause for loss to follow-up was the administrative censoring at the end of the study which was year 2011. Consequently, the year of enrollment was the only factor affecting the censoring distribution. In our sample, all the subjects share the same year of enrollment 2004, so we may reasonably make the independent censoring assumption. Among them 76 died from the cancer and 337 had deaths unrelated to cancer. The process of identifying of the causes is detailed in Riviere et al. (2019). There were 9326 binary claims codes in the data. Here we would like to identify the risk factors for non-cancer mortality using the Fine-Gray model. We kept only the claims codes with at least 10 and at most 1990 occurrences. The resulting dataset had 1197 covariates. We center and standardize all the covariates before performing the analysis. To determine the penalty parameters λ and λ j we used 10-fold cross-validation.
In Table 3, we present the result for 21 coefficients. Here, we focused on potential risk factors for non-cancer mortality, such as heart disease and colon cancer (different than prostate cancer); the coefficients to be tested were chosen ahead of time following recommendations from the doctors. We also include the clinical markers associated with the prostate cancer in comparison. A descriptions of the variables is given in Table 4. For each coefficient, we report the initial estimate β, one-step estimate b, corrected SE, the 95% CI constructed with the corrected SE and the Wald test p-value (2-sided) calculated using the uncorrected SE. In Table 3, we see that the claims codes ICD-9 4280, CPT 93015, ICD-9 42731 are all related to the heart disease, and are all significant at 5% level Bonferonni correction for the 21 variables included in the table. However, a heart attack indicator variable, ICD-9 41189, shows up significant at 10% level although the naive regularized estimator was not able to select this variable as important; this indicates that our inference procedure is much more delicate (stable) at discovering significant variables. In support of that, an indicator of a possible cancer in the abdomen, CPT 74170, is reported as significant at 5% although the initial Lasso regularized method failed to include such variable. Similar result is seen for the indicator of a fall (CPT 72050) which for an elderly person can be fatal. An indicator of a colon cancer (CPT 45380) turns out to be significant at 10% although the Lasso method set it to zero initially. Therefore, our one-step method is able to recover important risk factors that would have been missed by the initial regularized estimator. In contrast, non-life-threatening diseases, were not selected as significant predictors for the non-cancer mortality. These include Parkinson's (ICD-9 3320), Psychosis (ICD-9 2989), Bronchitis (ICD-9 49121) and Dementia (ICD-9 2948) in the table. It is worth noting that some of these were selected by the initial estimate but were then corrected by our test. We also note that the prostate cancer related variables, PSA, Gleason Score ahd AJCC all have large p-values for non-cancer mortality. This is consistent with the results in Hou et al. (2018), where under the competing risk models the predictors for a second cause only has secondary importance in predicting the events due to the first cause.

Discussion
This article focuses on estimation and inference under the Fine-Gray model with many more covariates than the number of events, which is well-known to be the effective sample size for survival data. The article studies the rate of convergence of a Lasso estimator and develops a new one-step estimator that can be utilized for asymptotically optimal inference: confidence intervals and testing. These results can be generalized to any sparsity-inducing and convex penalty functions including but not limited to one-step SCAD, adaptive LASSO, elastic net, to name a few. Moreover, it is worth noting that the variance estimation is novel in that it regresses a re-weighted score vector onto the score vector; in this way, the usual difficulty with asymptotic Hessian is avoided; it is worth pointing that the sandwich estimator or bootstrap carry biases in high-dimensions. An often overlooked restriction on the time-dependent covariates Z i (t), i = 1, . . . , n, under the Fine-Gray model is that Z i (t) must be observable even after the i-th subject experiences a type 2 event. In practice, Z i (t) should be either time independent or external (Kalbfleisch and Prentice, 2002). In our case the continuity conditions (C3) and (D3) are easily satisfied if the majority of the elements in Z i (t) are time independent, which is most likely to be the case in practice. Our theory does not apply in studies involving longitudinal variables that are supposed to be truly measured continuously over time.
We have illustrated that the method based on regularization only (without bias correction) might have severe disadvantages in many complex data situations -for example, it may potentially fail to identify relevant variables that are associated with the response. From the analysis of the SEER-medicare data, we see that variables like CPT 72050 (related to fall) or, CPT 74170 (related to diagnostic imaging of the abdomen, often for suspected malignancies) would not have been discovered as important risk factors for non-cancer mortality by regularization alone. In reality, both can be life-threatening events for an elderly patient. The one-step estimate, on the other hand, was able to detect these, therefore providing a valuable tool for practical applications. The one-step estimator is applicable as long as the model is sparse, and no minimum signal strength is required; this is another important aspect which makes the estimator more desirable for practical use than the LASSO type estimators.

Acknowledgement
We would like to acknowledge our collaboration with Dr. James Murphy of the UC San Diego Department of Radiation Medicine and Applied Sciences on the linked Medicare-SEER data analysis project that motivated this work. We would also like to thank his group for help in preparing the data set.

Appendix
In the appendix, we denote global quantities as Q and event sets as Ω with subscripts labelled by their order of appearance. Other quantities are all local, i.e. only defined for the current Lemma. We denote the ordered observed type-1 event times as T 1 (1) , . . . , T 1 (K T ) .

A Concentration Inequalities
Here we give the statements of the inequalities frequently used in our proofs. The notations in this section are all generic.

A.2. Concentration Inequalities for Time-dependent Processes
is a counting process bounded by K N . Denote its jumps as for some d s (t) and J s (t) satisfying d s (t) max < L S and J s (u) max < K S , and

B Proofs of Main Results
We shall present our proofs in the following order. First, we give the proofs to our theorems using the main Lemmas stated in Section 3. Second, we present the auxiliary lemmas necessary for the proofs of main Lemmas. Third, we present the proofs to the main Lemmas. Lastly, we present the proofs to the our concentration inequalities and auxiliary lemmas.

B.1. Proofs of Theorems
Proof of Theorem 1.
Observe that the same techniques as those of Huang et al. (2013) apply (see for example Lemmas 3.1 and 3.2 therein). The structure of the partial likelihood is the same as that of the Cox model modular the IPW weight functions w j (t). Following the same line of proof we can easily obtain on the event { ṁ(β o ) ∞ < λ(ξ − 1)/(ξ + 1)}, the estimation error of LASSO estimator β defined in (2.1) has the bound where ς is the smaller solution to The proof is then completed by applying the conclusion of Lemma 1.

Proof of Theorem 2.
Be Lemmas 5 and 6, we have In Lemma 7, we have shown that V max is bounded by K 2 (1 + Ke K t * ) 2 {1 + 2(1 + K)e K /ρ 2 } 2 with probability tending to one. In Lemma 3, we have shown that Θ 1 is bounded by K/ρ. Then, we can apply Lemmas 4 and 7 to get Note that we use the following fact

Proof of Theorem 3.
Since we assume (D1) now, the relative risks are bounded almost surely from above and below by constants 0 < e −K ≤ e β o Z i (t) ≤ e K < ∞. We may set K 4 = e K to directly obtain (C2) from (D2). We can also improve the rate of estimation error in Theorem 1 by log(n) because we need not let Q 1 (ε) in Lemma B.5 to grow with n.
Lemma B.9. Let Γ j , β and γ * j be defined as in (

B.3. Proof of Main Lemmas
Proof of Lemma 1.
Let T 1 (1) , . . . , T 1 (K T ) be the observed type-1 events. We may decompose the scoreṁ(β o ) as its martingale proxy plus an approximation error, with Z andZ defined in (1.4) and (3.5), respectively. Recall that the counting process for observed type-1 event can be written as Thus, we may apply Lemma 3.3 in Huang et al. (2013) under (3.13) Notice that the inequality is sharper than that in Lemma A.4(i) because the compensator part of˙ m(β o ) is zero. The concentration result for approximation error is established in Lemma B.5 on Ω 4 (ε). We obtain the concentration inequality forṁ(β o ) by adding the bounds and tail probabilities together.

Proof of Lemma 2.
Our strategy here is the same as that for Lemma 1. We first show that κ ξ, O; −m(β o ) is lower bounded by κ(ξ, O; −¨ m(β o )) plus a diminishing error. Since¨ m(β o ) takes the form of a Cox model Hessian, we then may apply the results from Huang et al. (2013).
We can directly bound By (D2), the minimal eigenvalue of Σ is at least ρ. We obtain through a spectral decomposition that the maximal eigenvalue of Θ = Σ −1 is at most ρ −1 . Hence, we have
We further decompose I 2 into 3 terms By (D1) and (D3), each M 1 i (t) has one jump at observed event time and e K K−Lipschitz elsewhere. Since the {C i , T 1 i : i = 1, . . . , n} is a set of independent continuous random variables, there is no tie among them with probability one. Hence, we may modify the integrand in I 2 and I 2 at observed censoring times without changing the integral. Replacing , we can apply Lemma B.6(ii) to get that I 2 ∞ and I 2 ∞ are both o p (1).
The total variation of M 1 i (t) is at most max{1, e K Kt * } 1. By Lemma B.6(i), Besides the one in Lemma B.4, ω i (t)− ω i (t) has another martingale representation. Denote the Nelson-Aalen estimator We have a F t martingale By Lemma A.4(i), sup t∈[0,t * ] |M c (t)| = O p n −1/2 For t > X i and δ i i > 1, with an error It is the discrepancy between the Kaplan-Meier and the Nelson-Aalen plus a second order Tailer expansion remainder. We shall show that it is O p (1/n). Since |M c (t)| = O p n −1/2 , the second order remainder Let c k be an observed censoring time. The increment in − log( G(t)) − H c (t) at c k is a second order remainder is bounded away from zero, and − log(G(t)) ≤ − log(G(t * )) is bounded from above. We have shown that both G(t) and H c (t) are uniformly √ n consistent. We obtain that G(X i ) is bounded away from zero and H c (t) is bounded with probability tending to one. Putting these together, we obtain sup Define π(t) = n −1 n i=1 I(X i ≥ t) and q(t) = E{ q(t)}, π(t) = E{ π(t)}. We write I 4 as i.i.d. sum plus error through integration by parts, is uniformly bounded by K(Kt * + 1). It has at most one jump and is KK−Lipschitz elsewhere. Hence, we can apply Lemma A.3 (ii) to get sup t∈[0,t * ] q(t) − q(t) ∞ = O p ( log(p)/n) and sup t∈[0,t * ] |π(t)− π(t)| = O p ( log(n)/n). Notice that I Finally, we write c Θṁ(β o ) as i.i.d. sum We have E{c Θη i } = 0 because of its martingale structure. We show E{c Θψ i } = 0 again by introducing its martingale proxy The first term above is zero because of the martingale structure. The second term is zero because the IPW weights satisfy E{ ω is mean zero and bounded by K/ρK(1 + Kt * ) + K/ρK(1 + Kt * )(1 + Kt * )2e K /ρ 2 with probability equaling one. The variance c ΘVΘc has a bounded and non-degenerating limit ν 2 . Hence, {c Θ(ψ i − η i ) : i = 1, . . . , n} satisfies the Lindeberg condition. By Lindeberg-Feller CLT, We conclude the proof of the Lemma.
Proof of Lemma 7.

By Lemma
We repeat the trick for η i − η i . Applying Lemma A.4(ii) to the F * i,t martingale Putting the rates together, we have sup We can directly obtain sup t∈[0,t * ] | π(t) − π(t)| = O p log(n)/n from Lemma A.3 (ii). Define The total variation of M c i (t) is at most 1 + 2e K /ρ 2 with probability tending to one by Lemma B.7. Using the results so far, we have sup i=1,...,n The remainder is a F t martingale. We can put the n martingales in R p into a R np vector and apply Lemma We have shown that sup i=1,..., Lemmas B.1 and B.7. In addition, we observe that n −1 n i=1 (η i + ψ i )(η i + ψ i ) is an average of i.i.d. terms whose expectation is defined as V. By Lemmas B.1 and B.7, we have the uniform maximal bound sup i=1,...,n . We finish the proof by applying Lemma A.1 to the last term in the decomposition above,

B.4. Proofs of Auxiliary Lemmas
Proof of Lemma A.3.
(i) Without loss of generality, let t 11 be the first jump time of N 1 (t). By the i.i.d. assumption, t 11 is independent of all S i (t) with i ≥ 2. Thus, the sequence is a martingale with respect to filtration σ S i (t), i ≤ l , l = 2, . . . , n . The increment is bounded as Applying Lemma A.2 to L n , we get Pr ( L n max > K S x) < 2qe −nx 2 /2 . Since the dropped first term is also bounded by K S /n, we get Pr S (t 11 ) − s(t 11 ) max > K S x + K S /n < 2qe −nx 2 /2 .
We use simple union bound to extend the result to all t ij 's whose number is at most nK N .
Combining the result from Lemma A.3(i), we obtain over a grid containing T n and jumps of N i (t). We only need to show that the variation ofS(t) − s(t) is sufficiently small inside each bin created by the grid. Let t and t be consecutive elements by order in T n . By our construction, there is no jump of any of the counting processes N i (t) in the interval (t , t ). Otherwise, the jump time is another element in T n between t and t so that t and t are not consecutive elements by order. Under the assumption of the lemma, elements of all S i (t)'s are L S −Lipschitz in (t , t ). Moreover, |t −t | ≤ t * /n because of the deterministic {kt * /n : k = 1, . . . , n} ⊂ T n . Along with the càglàd property, we obtain a bound of variation For any t ∈ (t , t ), we bound the variation of s(t) by For arbitrary t ∈ [0, t * ], we find the the corresponding bin (t , t ] contains t. Putting the results together, we have Proof of Lemma A.4.
(i) The summands in M Φ (t) are the integrals of F t− -measurable processes over F tadapted martingales, so M Φ (t) is a F t -adapted martingale (Kalbfleisch and Prentice, 2002, p.165). Suppose {T i : i = 1, . . . , n} are the jump times of {N i (t)}. We artificially set T i = t * if N i (t) has no jump in [0, t * ]. Define 0 ≤ R 1 ≤ · · · ≤ R 2n be the order statistics of {T i : i = 1, . . . , n} ∪ {kt * /n : k = 1, . . . , n}. Hence, {R k : k = 1, . . . , 2n} is a set of ordered F t stopping times. Applying optional stopping theorem, we get a discrete time martingale M Φ (R k ) adapted to F R k . The increment of M Φ (R k ) comes from either the counting part or the compensator part, which we can bound separately. By our construction of R k 's, each left-open right-closed bin (R k , R k+1 ] satisfies two conditions. There is at most one jump from n i=1 N i (t) in the bin at R k+1 . The length of the bin is at most t * /n. The increment of the martingale M Φ (t) over (R k , R k+1 ] is decomposed into two coordinate-wise integrals, a jump minus a compensator, With the assumed a.s. upper bound for sup t∈[0,t * ] Φ i (t) max ≤ K Φ , we have almost surely the jump of M Φ (t) in the bin be bounded by Additionally with the assumed upper bound for sup i=1,...,n sup t∈[0,t * ] h i (t) ≤ K h , we have the compensator of M Φ (t) increases over the bin by at most We obtain a uniform concentration inequality for M Φ (R k ) by Lemma A.2 Pr sup k=1,...,2n Remark that the uniform version of Lemma A.2 is the application of Doob's maximal inequality (Durrett, 2010, Theorem 5.4.2, page 213). For t ∈ (R k , R k+1 ), we use the bounded increment derived above Φ i (t) max ≤ K Φ,ε a n ≥ 1 − ε/2 for any n. We apply Lemma A.4(i) to obtain that event Φ i (t) max ≤ K Φ,ε a n occurs with probability no less than 1 − ε.
Proof of Lemma B.1.
Notice all a i (t)'s are nonnegative. Hence, n i=1 |a i (t)| = n i=1 a i (t). We apply Hölder's inequality for each coordinate Hence, the maximal norm of n i=1 a i (t)Z i (t) ⊗l is bounded by (K 3 /2) l under (3.13). Similar result can be achieved with the sum replaced by the expectation.

Proof of Lemma B.2.
Since {I(X i ≥ t * ), i = 1, . . . , n} are i.i.d. Bernoulli random variable, we may apply Lemma A.1 for lower tail, By (3.14), we can find lower bounds for the probability We may relax the inequality at x = ρ 2 /(2K 4 ) to is a lower bound for S (0) (t). The summands in S (0) (t; K 4 ) are i.i.d. uniformly bounded by K 4 . Thus, we may apply Lemma A.3(i) with one-sided version, By (C3), the expectation has a lower bound We relax the inequality at x = (ρ 2 /2 − 1/n)/K 4 , Proof of Lemma B.3.
Since i > 1 implies T 1 i = ∞, the probability of observing a type-2 event conditioning on Z i (·) has an upper bound Hence, we may derive a bound for if we can bound ∞ 0 I e β o Z i (u) ≥ K e x h 1 0 (u)du away from zero with a certain x whenever e β o Z i (t ) > K e for some t ∈ [0, t * ].
Under (C3), there is an interval I containing t of length ρ 4 in which Z i (·) has no jumps. The variation of linear predictor is bounded So, the relative risk e β o Z i (t) is greater than K e exp{−K 6 K 7 β o ∞ ρ 4 } over I . Hence, we get a lower bound for We finish the proof by taking a union bound over i = 1, . . . , n.
(i) By (C1) and (D1), we have e β o Z i (t) Z i (t) ⊗l max ≤ (K 3 /2) l e K 1. Thus, all terms involved are bounded. Moreover, e β o Z i (t) Z i (t) ⊗l jumps only at the jumps of N z i (t) by (D3). Define the outer product of arrays u ∈ R p 1 ×···×p d and v ∈ R q 1 ×···×q d as Between two consecutive jumps of N z i (t), Hence, e β o Z i (t) Z i (t) ⊗l satisfies the continuity condition for Lemma A.3 (ii).
Finally, we extend to results to the quotients by decomposition The denominators are bounded away from zero by Lemma B.7 by choosing K 4 = e K . (ii) First, we show that ∆ i (t) is related to the martingales M c is non-zero only after an observed type-2 event. To simplify notation, we define the indicator for non-zero ∆ i (t), υ i (t) = r i (t)Y i (t)I(t > X i ) = I(δ i i > 1)I(t > X i ). Denote the Nelson-Aalen type estimator for censoring cumulative hazard as by Lemma A.1. Multiplying the rates together, we get I 1 max = O p log(q) log(q )/n = o p (1). I 2 can be expanded as By Lemma B.7, n { n k=1 I(X k ≥ t)} −1 = O p (1). The integrand in I 2 is the product of M g (t) and a O p (1) term. Hence, we can apply Lemma A.4 (ii) to get I 2 max = O p log(q ) log(qq )/n = o p (1). By (D3), we may further expand I 3 into n 1/2 where φ(Z j (t)) = φ(Z j (t))−φ(Z j (t−)). By assumption on h(Z) and (D3), |∇φ(Z j (t)) d z j (t)| and φ(Z j (t)) are bounded by L h K and L h K, respectively. With sup t∈[0,t * ] |M g (t)| = O p log(q )/n and N z j (t * ) < K n = o( n/(log(p) log(n))), we may replace the ∆ j (t)'s by ∆ j (t)'s with an o p (1) error. Each ∆ j (t)∇φ(Z j (t)) d z j (t) has mean zero and at most K n + 1 jumps, and it is (L h K + K φ K)-Lipschitz between two consecutive jumps under (D3) and conditions on φ(z). By applying Lemma A.3(ii), we get ∆ j (t ik ) φ(Z j (t ik )) = O p ( log(nK n q)/n) = O p ( log(nq)/n).
Hence, I 3 max = O p K n log(nq) log(q )/n = o p (1). This completes the proof.
Proof of Lemma B.7.
Proof of Lemma B.9.
Proof of Lemma B.10.
(i) We define Hence, Now, Σ is average of i.i.d. with mean Σ and bounded maximal norm K 2 . We apply Lemma A.1 with union bound, Choosing x = log(2p 2 /ε)/(2n), we have Σ − Σ max = O p ( log(p)/n). (ii) We alternatively use the following form By Lemma B.6(iii), we have We also have a similar form for For any non-zero g ∈ R p−1 , let g * be its embedding into R p defined as g * k =    g k k < j 0 k = j g k−1 k > j Then, we may establish a lower bound for the smallest eigenvalue of Σ −j,−j by (D2) inf 0 =g∈R p−1 g Σ −j,−j g = inf 0 =g∈R p−1 g * Σg * ≥ ρ g 2 2 .
Therefor, if ξ max 1, we must have that inf j κ j (ξ j , O j ) 2 ≥ ρ/2 occurs with probability tending to one.