Cox models with time‐varying covariates and partly‐interval censoring–A maximum penalised likelihood approach

Time‐varying covariates can be important predictors when model based predictions are considered. A Cox model that includes time‐varying covariates is usually referred to as an extended Cox model. When only right censoring is presented in the observed survival times, the conventional partial likelihood method is still applicable to estimate the regression coefficients of an extended Cox model. However, if there are interval‐censored survival times, then the partial likelihood method is not directly available unless an imputation, such as the middle point imputation, is used to replaced the left‐ and interval‐censored data. However, such imputation methods are well known for causing biases. This paper considers fitting of the extended Cox models using the maximum penalised likelihood method allowing observed survival times to be partly interval censored, where a penalty function is used to regularise the baseline hazard estimate. We present simulation studies to demonstrate the performance of our proposed method, and illustrate our method with applications to two real datasets from medical research.


Score vector and Hessian matrix elements
Here we provide further details of the necessary elements for computing β (k+1) , γ (k+1) and θ (k+1) in our Newton-MI algorithm, outlined in Section 3 of the main paper.
Firstly we detail the values of c and a which form the diagonals of the n × n matrices C and A, needed to update the parameter vector β. We have c = H * δ * + H I * δ * I , where H and H I are vectors whose elements are H(y i ) and δ I i H(t L i ) respectively, δ * and δ * I are vectors whose elements are δ * i = δ i + δ R i − δ L i S(y i )/(1 − S(y i )) − δ I i S(t R i )/(S(t L i ) − S(t R i )) and δ * Ii = δ I i S(t L i )/(S(t L i ) − S(t R i )) respectively. Here, ' * ' denotes element-wise multiplication throughout this paper. For a we have where c i is the i th element of vector c. Clearly, if a sample only has event times and right censoring times (i.e. δ L i = δ I i = 0) then the matrices A and C are identical. Next we detail the value of vector E and the block diagonal matrices B and E, required for updating the estimate of the parameter vector γ. This computation requires us to establish some additional notation. Where there are interval censored observations in a data set, it is necessary to construct a long format data frame which contains rows corresponding to interval censoring observations truncated at t L i . Consider for individual i a vector of length n i + 1 that gives all the values of t ia for a = 1, ..., n i as well as y i , given as [t i1 , t i2 , ..., t in i , y i ] T . Then say that the number of t ia < t L i is equal to c i . Let τ c i be a vector of length c i with values [t i1 , ..., t ic i ] T . The truncated list of times can then be given as a vector of length n i + 1, with is a vector of zeroes of length n i − c i . Note that if individual i is not interval censored, then c i = 0 and τ i = 0 n i . It is also convenient to establish notation for what could be considered the long format versions of the event and censoring type indicators for individual i, (δ i , δ R i , δ L i , δ I i ). These can be represented as ( i , R i , L i , I i ). i is an n i -vector equal to δ i 1 n i . We can then denote the N -vector = [ T 1 , ..., T n ] T for the whole sample. Similarly, we can obtain R , L , I for right-, left-and interval-censoring respectively. Now, we can express the vector E = ζ * with ζ being an N -vector with 1's at indices n 1 , (n 1 +n 2 ), . . . , N and 0's elsewhere (i.e. 1's at indices corresponding to the last record for each i ) and as just established. Additionally, we have E, an N -vector given by E = (E 1 , ..., E n ). The vector E i has the dimensions n 1 ×1 and is given by respectively, and * i and * Ii are also both n i vectors with Recall that a = 0, . . . , n i − 1 and that τ ia are the changing points in z i (t) truncated at t L i as defined above. Then, the matrix B is an N × N block diagonal matrix given by B = diag(B 1 , ..., B n ), and each B i is a diagonal matrix, with its diagonal given by the n i -vector Finally, we detail the necessary computations for updating the parameter vector θ. Let p, p * and p * I all be n × m matrices. The (i, u) th elements of these matrices are, respectively, ψ u (y i ), Ψ * ui (y i ) and δ I i Ψ ui (t L i ). Let e be an n-vector for the e x i β 's, and let f be an n × n diagonal matrix with (i, i) th element h 0 (y i ). Additionally, let J be an m-vector equal to the first derivative of the penalty function, and let J pos contain the positive elements of J, and J neg contain the negative elements of J, so that J = J pos − J neg . We can compute two m-vectors s pos and s neg so that s pos = p f δ + p * δ * pos e − J neg and s neg = p * δ * neg e + p * We can then compute s = s pos − s neg . Finally, we can compute g θ , an m × m diagonal matrix with (u, u) th element θ u /(s neg,u + ξ), where ξ is a small constant included to avoid a zero denominator in the computation of g θ .

Asymptotic consistency and normality proofs
In this section we state asymptotic consistency and normality results that allow for largesample inferences to be made on regression parameters and survival quantities from the proposed model. Importantly, these results mean that our method does not require the use of computationally intensive methods such as bootstrapping for inference.

Asymptotic consistency
For the asymptotic consistency result, we denote the true values of the parameters β, γ and h 0 (t) as β 0 , γ 0 and h 00 (t). Let a and b be the minimum and maximum of all the observed survival times respectively, including interval censoring but excluding 0 and ∞. Then, let C r [a, b] be the set of functions that have r continuous derivatives over [a, b]. The parameter space for β can be given by B = {β : |β k | ≤ C 1 < ∞, ∀k}. The parameter space for γ can be given by G = {γ : |γ j | ≤ C 2 < ∞, ∀j}. The parameter space for h 0 (t) can be given by Before defining the MPL estimator of τ , it is necessary to account for the fact that this method estimates an approximation of h 0 (t). For convenience, the approximation can be denoted ash 0 (t) = m u=1 θ u ψ u (t). The parameter space forh 0 (t) can be given by The MPL estimator of τ n is thenτ n = (β,γ,ĥ 0 (t)). We require the following conditions: A1. The matrices X and Z are bounded, and both E(XX T ) and E(ZZ T ) are non-singular.
A2. The penalty function J(η) is bounded over Γ and Γ n .
A4. The knots and basis functions are selected such that for any Theorem 1 demonstrates asymptotic consistency forτ n when the number of basis functions m → ∞ but m/n → 0 when n → ∞, and the scaled smoothing value µ n = λ/n → 0 when n → ∞. For proofs we refer the reader to ?.

Asymptotic normality
For the asymptotic normality results, we comment that it is necessary to now restrict m to be a finite number, as in ? and ?. We note however that the fixed m is not pre-determined as it depends on the sample size n; we use m = n 1/3 0 as a rough guide, where n 0 is the non-right censored sample size. In addition, for the asymptotic normality results we require a method to take into account potentially active constraints in the estimation of θ ≥ 0. These active constraints are particularly likely to occur where there is a larger number of knots specified than strictly necessary, as the smoothing parameter will push these unnecessary parameters to 0. Ignoring active constraints may lead to negative variance in the estimation of θ.
To address the possible presence of active constraints, we adopt the following method. Recall that we can denote η = [β, γ, θ] , the parameter vector with length p + q + m. Let the MPL estimate of η beη, and let η 0 be the true value of the parameter vector. Without loss of generality, we assume that the first r values of θ are 0, and as such are actively constrained.
where 0 is a matrix of zeroes and I is an identity matrix. We have the condition U U = I (m−r+p+q)×(m−r+p+q) satisfied. We have the conditions: B1. The distributions of x i and z i are independent of η.
B2. The limit lim n→∞ [n −1 l(η)] exists and has a unique maximum at η 0 ∈ Ω, where Ω is the parameter space for η and is a compact subspace of R p+q+m . That is to say, if the sample size is infinity, the true parameters can be obtained exactly from maximising the likelihood.
B3. l(η) has a finite upper bound and is twice continuously differentiable in a neighbourhood of η 0 , and the matrices B4. The penalty function J(η) is twice continuously differentiable on Ω, and these derivatives are bounded.
B5. The matrix U T F(η)U is invertible in a neighbourhood of η 0 .
Theorem 2 states the asymptotic normality results under these conditions; we again refer to ? for detailed proofs.
Theorem 2. Let µ n = λ/n. Assume that µ n = o(n 1/2 ) and that we have the first r active constraints in the MPL estimate of θ. Define matrix U as above. Assume B1-B5 hold. Let Under these conditions, when n → ∞, To implement the results of Theorem 2, we require a method for identifying the presence of active constraints in practice. The method used here closely follows that proposed by ?. Active constraints can be identified by inspecting both the value ofθ u and the corresponding gradient for each u. After the Newton-MI algorithm has reached convergence, someθ u may be exactly zero with negative gradients, and thus are clearly subject to an active constraint. Furthermore, there may be someθ u that are very close to, but not exactly, zero. For theseθ u , a corresponding negative gradient value is indicative that they are also subject to an active constraint. In practice, active constraints are defined where, for a given u,θ u < 10 −3 and the corresponding gradient is less than −ε where ε is a positive threshold value such as 10 −2 . After the indices associated with active constraints are identified, obtaining the matrixF(η 0 ) −1 is a very straightforward computation. The matrix U T F(η)U is obtained by removing the rows and columns of F(η) associated with the active constraints. The result is then inverted, and then padded with zeros in the deleted rows and columns to obtainF(η 0 ) −1 . To make use of these asymptotic results for inference on finite samples, it is necessary to approximate the distribution forη when n is large. Doing so also incorporates non-zero values for the smoothing parameter λ into the inference on the parameter estimates. The necessary results are presented below in Corollary 1.
Corollary 1. Assume that the smoothing parameter λ n. Define Then, when n is large, the distribution for the MPL estimateη − η 0 can be approximated by a multivariate normal distribution having mean zero and covariance matrix These results allow for inferences to be made not only on both sets of regression parameters but also on quantities associated with the baseline hazard function.