Analysis of survival data with cure fraction and variable selection: A pseudo-observations approach

In biomedical studies, survival data with a cure fraction (the proportion of subjects cured of disease) are commonly encountered. The mixture cure and bounded cumulative hazard models are two main types of cure fraction models when analyzing survival data with long-term survivors. In this article, in the framework of the Cox proportional hazards mixture cure model and bounded cumulative hazard model, we propose several estimators utilizing pseudo-observations to assess the effects of covariates on the cure rate and the risk of having the event of interest for survival data with a cure fraction. A variable selection procedure is also presented based on the pseudo-observations using penalized generalized estimating equations for proportional hazards mixture cure and bounded cumulative hazard models. Extensive simulation studies are conducted to examine the proposed methods. The proposed technique is demonstrated through applications to a melanoma study and a dental data set with high-dimensional covariates.


Introduction
In the time-to-event analyzes, it is usually assumed that all subjects will eventually experience the event of interest if the follow-up period is sufficiently long. However, in many research fields, including biomedical, genetic, and social studies, some subjects may never experience the event of interest in their lifetime. These subjects are referred to as the cured or nonsusceptible subjects. For example, in a melanoma progression study, 1 the cured patients never experience a melanoma relapse after the initial treatment. On the other hand, in a dental study, 2 the cured subjects are the teeth that underwent proper periodontal treatments and can last a lifetime. In general, it is difficult to identify the cured subjects, but their presence is signaled by a leveling of the Kaplan-Meier (KM) survival curve at the end of the follow-up, for example, Figure 1(a). Standard models do not account for the cure fraction and could lead to biased estimates of the survival of susceptible subjects. 3 Even if the cure fraction is accounted for, the dental study posts additional challenges on highdimensionality. More than 50 predictors relevant to decision making in periodontal treatments are potential risk factors affecting the teeth' survival. Motivated to tackle these emerging and challenging scientific questions, we propose a new estimating procedure using a pseudo-observations approach for the statistical inference on the parameters of interest and extend the proposed methods to regularized regression.
Two types of cure models have been popular in the literature, with most emphasis on the mixture cure (MC) model. 4 The MC model assumes that the population consists of two types of patients, a cured group in which the patients are not at risk of experiencing the event and a susceptible group in which they eventually experience the event. The MC model consists of two components, incidence and latency. The former indicates whether the subject is susceptible, and the latter represents the time to the event when the patient is in the susceptible group. For the incidence component, a logistic regression model is often used to describe the covariate effects on the cure fraction. In contrast, parametric and semiparametric models have been proposed for the latency component to describe the underlying failure time distribution of susceptible subjects. Among those, Weibull model, 4 generalized F model, 5 Cox proportional hazards (PH) model, 6 and accelerated failure time model 7 have been studied.
The second type of cure model is the bounded cumulative hazard (BCH) model, also known as the promotion time cure model, which models the survival times through an improper survival function, e.g Tsodikov 8 The idea was first introduced by Yakovlev et al. 9 in biological considerations, in which cancer recurrence is promoted by the number of carcinogenic cells and disease progression. Thus, the parameters specified in the BCH model bear exact biological meaning. Treating the carcinogenic cell counts as latent and nuisance, the BCH model is suitable for any survival data types as long as it is reasonable to assume the data have a cure fraction. 10 Incorporating covariates into the BCH model modifies the cure fraction and introduces a long-term covariate effect on the survival. The BCH model has a PH structure through the long-term effect parameter. Tsodikov et al. 11 further incorporated covariates to the baseline survival function through another PH structure, introducing a short-term covariate effect on survival. The two-component BCH model of Tsodikov et al. 11 is termed the PHPH model. The MC model and the PHPH model consist of different covariate effects, each providing unique clinical interpretations.
Existing estimating procedures for the MC model and the PHPH model usually involve updating the parameters via expectation-maximization (EM) algorithms to account for latent variables, for example, cure fraction, in the likelihood. However, these approaches could be computationally expensive in high-dimensional data or when the bootstrap method is used in variance estimation. Thus, estimating procedures such as the pseudo-observations approach that does not rely on the EM algorithm are more appealing for practical use. The concept of the pseudo-observations approach is to create pseudo values for the quantities of interest at individual levels using the analogy of leave-one-out cross-validation. These pseudo values are then treated as complete data where standard methods can be conveniently applied. Specifically, for subject i = 1, . . . , n, let T i be independent and identically distributed random variables and X i be a vector of covariates. The interest lies in modeling E[f (T i )|X i ], where f is a pre-specified transformation function of T i . Due to censoring, not all f (T i ) are observed. However, the observed or unobserved f (T i ) can be replaced by its pseudo-observationsρ i = nρ − (n − 1)ρ −i whereρ is a consistent and (approximately) unbiased estimator for the expectation ϱ = E[f (T)] andρ −i is the estimator for ϱ using the remaining n − 1 subjects, leaving subject i out from the sample. The pseudo-observations approach was first proposed by Andersen et al. 12 to model the transition probabilities in multi-state models. Since then, the pseudo-observations approach has been applied to many settings in survival analysis, including survival estimates, 13 the restricted mean survival times, 14 the cumulative incidence function, 15 the relative survival function, 16 the illnessdeath model with interval-censored data, 17 and the causal inference for recurrent event data. 18 Large sample properties have also been thoroughly investigated. [19][20][21] However, the pseudo-observations approach has not been applied to the analysis of survival data with a cure fraction.
Regularization and variable selection are commonly used in high-dimensional data analysis, but it has been less studied for cure models. Regularized procedures minimize an objective function that consists of a penalty function to reflect sparsity. Some of the popular penalty functions are the least absolute shrinkage and selection operator (LASSO), 22 adaptive LASSO (ALASSO), 23 and smoothly clipped absolute deviation (SCAD). 24 In the Cox MC models, Liu et al. 25 used LASSO and SCAD penalties to select variables based on penalized likelihood functions. Masud et al. 26 performed the variable selection based on the ALASSO penalty while considering the linear and nonlinear effects in both components. Masud et al. 27 further utilized the ALASSO penalty to the Cox MC and the BCH models. In both works, an EM algorithm was adopted to estimate the parameters, and the bootstrap resampling procedure was used to obtain standard error estimates. Their works can be computationally intensive in high-dimensional data. The penalized BCH model of Masud et al. 27 has limited application as it requires additional information on the latent carcinogenic cell counts, making their approach not applicable for survival data without similar biological interpretations. Moreover, the short-term covariate effect is not considered in their approach, limiting the understanding of the covariate impact on the timing of disease occurrence.
In this article, we develop new estimating procedures based on the pseudo-observations approach for both the MC and the BCH models. We further extend the proposed method by adopting the penalized generalized estimating equations (PGEE) approach 28 to perform variable selection. The proposed work closes the gap on variable selection in cure rate models with pseudo-observations techniques. The proposed approaches are attractive in several aspects. First, pseudo-observations can be straightforwardly used as complete outcomes for the generalized linear model (GLM) without indication of censoring. Second, the proposed estimating procedures are computationally efficient and faster in running time than standard approaches that adopt the EM algorithm for estimation and bootstrapping for standard errors as the unknown regression parameters are estimated via the generalized estimating equations (GEE) approach with corresponding variance estimates obtained by sandwich estimators. Third, unlike the estimation and variable selection of regression parameters in the incidence and latency component of the MC model 6,27 or the short-term and long-term effects of the BCH model 11,27 are performed simultaneously within one model, our proposed methods using pseudo-observations can perform the estimation and variable selection separately in each component of the MC model or each effect of the BCH model. Specifically, once the pseudo-observations for the quantity of interest for each component of the MC model or each effect of the BCH model are generated for each subject, they can be modeled with standard methods like GLMs. The GEE and PGEE estimating methods can be applied for parameter estimation and variable selection. Finally, the proposed regression estimators can be easily implemented via standard statistical software.
The remainder of the article is organized as follows. The MC model and the BCH model are reviewed in Section 2. The construction of pseudo-observations is described in Section 3. Inference procedure, model diagnosis, and variable selection are presented in Section 4. Comprehensive simulation results are reported in Section 5. The analysis of two real datasets is provided in Section 6. Concluding remarks are given in Section 7. Asymptotic properties and additional simulation results are provided in the online Supplemental Materials.

The cure models
Let Y denote the cure status of a subject, where Y = 1 if the subject eventually experiences an event (uncured, susceptible), and Y = 0 if the subject is a survivor (cured, non-susceptible). Let T = YT * + (1 − Y ) × ∞ be the survival time, where T * < ∞ is the failure time if the subject is susceptible. In the presence of right censoring, we assume the observed data consist of n independent replicates (T i , δ i , X i , Z i ), i = 1, . . . , n, which are copies of (T , δ, X , Z), wherẽ T = min {T , C}, δ = I(T ≤ C), C is the censoring time, and X and Z are vectors of covariates with dimensions p and q, respectively. We allow X and Z to be completely distinct, overlapped, or identical. When δ = 1, the subject experienced an event and Y = 1. However, when δ = 0, the cure status Y is not observed.

MC model
The MC model expresses the population survival function as where π = P(Y = 1) is the uncured rate and S u (t) is the conditional survival function of T * given Y = 1. The incidence component π is assumed to follow a logistic regression model where α 0 is a scalar and α is a p-column vector. For the latency component, we model the conditional survival function via the Cox proportional hazards model where β is a q-column vector of regression coefficients, and λ 0 (t) is the unspecified baseline hazard function with the cumulative function Λ 0 (t) = t 0 λ 0 (u)du. Under models (2) and (3), the MC model is called the PHMC model. 6

BCH model
Suppose Λ(t) is the cumulative hazard function of T * such that Λ(∞) = θ > 0. Under the BCH model, the population survival function can be written as where F(t) = Λ(t)/θ is a proper cumulative distribution function of a nonnegative random variable with F(0) = 0 and F(∞) = Λ(∞)/θ = 1. As t → ∞, one has lim t→∞ S(t) = exp (−θ) which indicates the cure rate. The covariate effects can be assumed on the impact of the parameter θ with θ(X) ≡ θ(X , γ 0 , γ), where θ(X , γ 0 , γ) is a known function that relates a p × 1 vector of regression coefficients γ to X with an intercept term γ 0 . This modeling strategy leads to the improper PH model, 29 and the covariates have a long-term effect because θ describes the long-term survival probability. 8 A common choice of θ(·) is the exponential parameterization θ(X ) = exp (γ 0 + γ ⊤ X ). Tsodikov 8 extends the improper PH model by adding a short-term effect by incorporating covariates into F(t) or survival function F(t) = 1 − F(t).
Specifically, the PHPH model of Tsodikov et al. 11 has the form where η(Z) = exp (ϕ ⊤ Z) and ϕ is a q-column vector of regression coefficients. To avoid overparameterization, we assume the coefficients ϕ do not contain an intercept term as suggested in Tsodikov et al. 11 When the cure fraction is the only parameter of interest, it makes no difference which model formulation (1) or (4) is chosen to estimate the cure rate nonparametrically. However, the two models become different when additional model assumptions are imposed on the cure rate and the latency distribution of T * in the MC model.

Pseudo-observations for MC model
A common approach for constructing pseudo-observations is to generate those from a nonparametric estimator of the parameter of interest. The MC model has two parameters, uncured rate π and the conditional survival function S u (t), to be estimated. There are two candidate estimators for the uncured rate π. The first one is proposed by Maller and Zhou,30 in which the cure rate 1 − π was estimated byŜ KM (t max ), whereŜ KM (t) is the KM estimator 31 and t max is the maximum of the observed event times. The result implies that π can be estimated byπ KM = 1 −Ŝ KM (t max ). Following the construction of pseudo-observations in Andersen et al., 12 one can define the pseudo-observations for subject i bŷ is the estimator for π using the remaining n − 1 subjects, leaving subject i out from the sample. The second estimator is based on the estimation of θ in Tsodikov 32 through the connection between models (1) and (4). To be specific, let t (1) < t (2) < . . . < t (D) be unique observed failure times, and let t (0) = 0 and t (D+1) = ∞. Let M j = n i=1 I(T i = t (j) , δ i = 1) and N j = n i=1 I(t (j) ≤T i < t (j+1) , δ i = 0) be the number of failure times at time t (j) and the number of censored times in the interval [t (j) , t (j+1) ), respectively. Under model (4), θ can be estimated bŷ . . , D − 1, and θ D = − log (N D /(N D + M D )). Consequently, F(t) can be estimated byF NP (t) = {j : t (j) ≤t}Ĵ j , whereĴ j =θ j /θ NP is the estimated jump size at time t (j) . Instead of using the Lagrange multiplier method, 32 we use the change of variables approach under the condition D k=1 J k = 1 to obtainθ NP andF NP (t). The estimating procedure is summarized in Web Appendix A. Since models (1) and (4) have the same cure rate, one could estimate π byπ NP = 1 − exp (−θ NP ) and create the pseudo-observations for π byπ whereπ −i NP is the estimator of π obtained when leaving subject i out from the sample. Web Appendix B shows the behavior of pseudo-observations from (6) and (7) based on simulated data. One can see that these pseudo-observations are not necessary within the range [0,1].
To create the pseudo-observations for S u (t), one can express S u (t) = π −1 · (S(t) − (1 − π)) from model (1) and imply that S u (t) can be estimated byŜ u, } is the estimator for S u (t) when leaving subject i out from the sample.

Pseudo-observations for BCH model
Under model (4), we propose two approaches to create pseudo-observations for θ and F(t), respectively. Since the cure rate lim t→∞ S(t) = exp (−θ) can be nonparametrically estimated byŜ KM (t max ), θ can be estimated byθ KM = − logŜ KM (t max ).
Moreover, as mentioned in Section 3.1, θ can also be estimated byθ NP . Thus, the pseudo-observations for θ can be created by one of the following two approachesθ Based on Tsodikov, 8 F(t) can be consistently estimated byF KM (t) = log (Ŝ KM (t))/ log (Ŝ KM (t max )). As mentioned in Section 3.1, F(t) can also be estimated byF NP (t) = {j : t (j) ≤t}Ĵ j . Thus, the pseudo-observations for F(t) can be created by one of the following two approaches,F i when leaving subject i out from the sample.

Statistical inference
The statistical inference of the pseudo-observations approach is based on the asymptotic unbiased property of the pseudo-observations for the parameter of interest. 33 Following Jacobsen and Martinussen 20 and Overgaard et al., 21 we present the proofs of the asymptotic unbiased property for proposed pseudo-observations in Web Appendix C of the online Supplemental Materials.

Estimation of parameters under MC model
Based on the pseudo-observations in Section 3.1, the parameters (α 0 , α ⊤ ) in the incidence component and β in the latency component of the PHMC model with (2) and (3) can be estimated separately. To estimate (α 0 , α ⊤ ), we consider the following GLM where g 1 (x) = log {x/(1 − x)} is the logit link function for a binary variable. The parameters (α 0 , α) can be estimated based on the GEE approach 34 using pseudo-observationsπ i , i = 1, . . . , n by solving the estimating equations whereπ i is the pseudo-observations for π, and V 1,i is the working variance. Let (α PO 0,KM ,α PO KM ) and (α PO 0,NP ,α PO NP ) be the estimators from (14) asπ i is replaced byπ i KM andπ i NP , respectively. To estimate β, the pseudo-observations for S u (t) are evaluated at several time points and used as responses in the GLM for the covariate effects. Specifically, let t = {t 1 , . . . , t H } be a set of distinct times between 0 and the maximum of observed event time, and letŜ i u (t h ) be the pseudo-observations (8) for subject i at time t h for h = 1, . . . , H. We assume the GLM with where ξ t h is the intercept at time t h , β is the regression parameters, and g 2 (x) is a link function. Common choices for g 2 include the log-log function log {− log (x)} and log function log (x). Model (15) becomes the Cox PH model (3) when g 2 (x) = log {− log (x)} and ξ t h = log Λ 0 (t h ). We use the following GEE to estimate the unknown parameters β and , and V 2,i is a H × H working covariance matrix that accounts for the correlation inherent from the pseudo-observations. 12 We denoteψ PO = (ξ PO H ,β PO ) as the estimators obtained from solving (16).

Estimation of parameters under BCH model
Under PHPH model (5), the short-term and long-term covariate effects can be estimated separately. To estimate the longterm effect (γ 0 , γ), we consider the following GLM where ε i , i = 1, . . . , n, are independent and identically distributed (i.i.d.) with mean zero, and g 3 is a link function. Possible choices of g 3 (·) are the log link function log (x) and the log-log function log {− log (x)}. Of these, setting (5), which motivates us to estimate the parameters (γ 0 , γ) based on pseudo-observations created byθ i KM orθ i NP via whereθ i is the pseudo-observations for θ, and V 3,i is a working variance ofθ i . Let (γ PO 0,KM ,γ PO KM ) and (γ PO 0,NP ,γ PO NP ) be the estimators obtained from (18) (9) and (10), respectively. Under the assumption We thus consider the GLM with . . , n, are i.i.d. with mean zero. We consider the following GEE to estimate ϕ and ς H where (11) and (12), respectively.

Variance estimation and model diagnosis
All estimators obtained from solving estimating equations mentioned in Sections 4.1 and 4.2 can be using the geese function in the R package geepack. 35 We adopt the approximate jackknife variance estimates, 36 which is available in the geese function. Note that adopting a sandwich estimator might lead to inconsistent and upward biased results for variance estimation; however, this has an insignificant impact in practical applications. 20 We follow the idea of pseudo-residuals 33 to assess the goodness-of-fit for (13) and (15) for the PHMC model. Define the pseudo-residuals . . , n} calculated at a given time t ∈ t. If the model fits the data well, no trend should be perceptible when plotting residuals against a covariate. Similarly, we consider pseudo-residuals {θ i . . , n} for model (17) and the pseudo- . , n} at a given time t ∈ t for model (19). The idea of pseudo-residuals is illustrated in the melanoma data in Section 6.1.

Variable selection
The proposed pseudo-observations approach allows variable selection and parameter estimation to be simultaneously implemented in each component of the PHMC model and the PHPH model by penalizing the corresponding GEEs. 28 Specifically, for the incidence component of the PHMC model, we penalize the GEE in (14) by where for some p × 1 vectors u and v, q λ 1 (u) = {q λ 1 (|u 1 |), . . . , q λ 1 (|u p |)} ⊤ is a vector of penalty functions for some tuning parameter , and u • v is the element-wise product of u and v. The intercept term α 0 is not penalized and has been left out of the penalty term in (21). Similarly, we consider the following PGEE for ψ for the latency component of the PHMC model by extending (16) where λ 2 is the tuning parameter, and the intercept term, ξ H , is left out of the penalty term. For the long-term effect of the PHPH model, we extend (18) to the following PGEE where λ 3 is the tuning parameter and the intercept term γ 0 is not penalized. Lastly, for the short-term effect of the PHPH model, we penalize (20) through where λ 4 is the tuning parameter and ς t H is not penalized. In the simulation studies and data analyzes, we illustrate the proposed procedure with the SCAD penalty 24 and select the tuning parameters via five-fold cross-validation, where the data are randomly partitioned into five subsets of approximately equal sizes. For a given tuning parameter λ, we calculate the overall cross-validated prediction error is the PGEE estimator based on data without the kth subset, |N (k) | is the size of the kth subset, m j is the number of pseudo-observations for subject j, and PR(θ (−k) ⊤ W jℓ ) are the pseudo-residuals as defined in Section 4.3 with covariates W jℓ . We use the estimates obtained from the unpenalized GEEs as the initial values in the iteration of PGEEs and cross-validation.

Simulation
Simulation studies are conducted to assess the finite sample performance of proposed estimators. We first evaluate the performance under the PHMC model. Specifically, we generate the cure status according to the logistic model (2) with one covariate X i , i = 1, . . . , n, generated from a Bernoulli(0.5) distribution. The regression coefficient is set at α = (α 0 , − 1) so that the treatment group (e.g. those with X i = 1) are more likely to be cured. The intercept is chosen to be α 0 = 2.8, α 0 = 2 or α 0 = 0.9 to achieve the average cure rates of 10% (14.1% among those with X i = 1), 20% (26.7% among those with X i = 1) or 40% (52.3% among those with X i = 1), respectively. On the other hand, the survival times are generated from the Cox PH model (3), with λ 0 (t) = 1/3, and (Z i1 , Z i2 ) generated from independent Bernoulli(0.5) and Uniform(0, 1), respectively. We set the regression coefficient in (3) at β = (1, 0.5) and generated the censoring times from Uniform(0, c), where c is chosen to make 10% of the censored subjects susceptible. Throughout the simulations, the pseudo-observations are calculated at 10 time-points from the quantitles of observed event times between 0 and t max . Table 1 summarizes the simulation results of 20% and 40% cure rates scenarios based on (14) and (16) with link functions g 1 (x) = log {x/(1 − x)}, g 2 (x) = log {− log (x)} and ξ t h = log Λ 0 (t h ). The scenario with 10% cure rate is presented in Table 1 in Web Appendix E. We only present the results with a working independence assumption among the pseudo-observations. Adopting a more complicated covariance structure provides no obvious improvement as presented in Klein and Andersen 37 and Graw et al. 19 The proposed estimates are compared with the EM-algorithm estimators obtained from Peng and Dear 6 with B = 100 bootstrap samples for standard error estimation. For each estimator, we report the average bias (Bias), the empirical standard error (ESE), the average of the standard error estimator (SEE), and the empirical coverage rate (CR) of 95% confidence interval based on 500 replicates with sample size n = 200, 400, 600, and 1000. Overall, the proposed estimators perform reasonably in all scenarios with Bias, ESE, and SEE all decreasing with increasing n while CRs are close to the nominal level of 0.95. The estimators (α PO 0,NP ,α PO NP ) and (α PO 0,KM ,α PO KM ) have similar performance, indicating that pseudo-observations constructed by (6) or (7) are both appropriate. Compared to the estimates of Peng and Dear, 6 our estimators have higher ESE and SEE under various sample sizes, indicating a loss of efficiency. Specifically, based on our simulation settings, the smallest relative efficiency loss for the latency component is around 12% with n = 1000 under the 10% cure rate scenario. The smallest relative efficiency loss for the incidence component is around 9% with n = 1000 under the 40% cure rate scenario. This loss efficiency situation is likely due to a discrete approximation to the baseline hazard function with a continuous time scale, for example, 10 selected time-points for the latency component in our approach. Note that the latency component yields smaller ESE and SEE under 10% cure rate compared to 20% and 40% cure rates scenarios as more noncured subjects contribute to the estimation of β.
On the other hand, Table 2 in Web Appendix E reports the average computing time in seconds for each estimator under the 10% cure rate scenario based on 500 replicates. For each fixed n, our proposed estimators including variance estimation have smaller total computing times (summation of latency and incidence) than that of the EM-algorithm estimate 6 using B = 100 bootstrap samples for standard error estimation. Computing times increase as long as the sample size increases. Among our proposed estimators for the incidence component, the computing time for the KM estimator (6) is faster than that of the estimator (7). Note that all results of computing times are implemented in R and performed on a Linux machine with an Intel Core i7-8565U processor and 15.4 GB memory.  38 We next evaluate the performance of the proposed estimators under the PHPH model (5). For subject i, two independent covariates X i1 and X i2 are generated from Bernoulli(0.5) and standard normal distribution, respectively. We set θ(X i ) = exp (γ 0 + γ 1 X i1 ), η(Z i ) = exp (ϕ 1 X i1 + ϕ 2 X i2 ) and F(x) = exp ( − 2x). It is not straightforward to simulate survival time from the improper survival function of model (5) as it has a positive mass at ∞. We thus utilize the connection between the MC model (1) and the PHPH model (5) to generate survival times. The data generation algorithm is summarized in Web Appendix D. We set γ 1 = −0.1, (ϕ 1 , ϕ 2 ) = (0.4, − 0.3) with two different values of γ 0 , 0.5 and −0.05, which correspond to 20% and 40% cure rate, respectively. The censoring times are independently generated from Uniform(0,c), where c is chosen so that 10% of the censored subjects are susceptible. Tables 3 and 4 (20), respectively. A maximum likelihood estimator (MLE) is implemented for comparison. The proposed estimates are virtually unbiased. The SEE is reasonably close to the ESE, and the CR is close to the nominal level. The two constructions of the pseudo-observations yield a similar pattern, indicating that both approaches are feasible for constructing pseudo-observations. The proposed estimators have higher ESE and SEE than MLE under different sample sizes, indicating a loss of efficiency. For example, the relative efficiency losses are around 19% and 25% for longterm and short-term effects, respectively, under the scenario, n = 1000, 40% cure rate, and 50% censoring rate.
To study the performance of variable selection under the PHMC model, we consider the covariate X = (X 1 , . . . , X 20 ) in which X 1 , X 2 are independently generated from Uniform(0,1), X 3 and X 4 are independently generated from Bernoulli(0.5), and X 5 , . . . , X 20 are generated from a multivariate normal distribution with E(X i ) = 0 and Cov (X i , X j ) = 0.5 |i−j| , for i, j = 5, . . . , 20. We set the regression coefficients at α 0 = 1.1, α = (0, 1, − 1.2, 0, 0, − 0.9, 0.8, 0, 0, . . . , 0) ⊤ , and β = (−0.7, 0, 1, 0, − 0.5, 0.8, 0, 0, 0, . . . , 0) ⊤ , and the baseline hazard function λ 0 (t) = 1/3. Those configurations yield a cure rate of 30%. The censoring times are independently generated from Uniform(0, c), where c is chosen so that either 10% or 30% of the censored subjects are susceptible. For each simulated data set, we fit the GEE full model that considers all covariates, the GEE Oracle model that only includes covariates with nonzero coefficients, the proposed PGEE model with SCAD penalty, and the PHMC model with LASSO and ALASSO penalties proposed by Masud et al. 27 Table 2  andβ j are the estimators of α and β based on the jth generated dataset. The TP and FP are calculated as the average number of selected covariates that have actual nonzero and zero coefficients, respectively. Five-fold cross-validation is used to determine the tuning parameter. For the incidence component, the two pseudo-observations approaches yield similar performances, making them practically identical. For all scenarios, the MSE of proposed PGEE approaches is smaller than that of the full model but is larger than that of the oracle model. However, it becomes closer to the MSE of the oracle model as n increases. In addition, when the censoring rate is 40%, the proposed PGEE approaches behave closer to the oracle model, have TP closer to the number of nonzero covariates, and decreasing FP as n increases. On the contrary, the FP of the LASSO and ALASSO estimators do not seem to decrease with increasing n. For the latency component, we consider the proposed PGEE with three different correlation structures: independence, exchangeable, and AR(1). The results of PGEE with independent structure tend to have slightly higher TP and FP and lower MSE than that of the PGEE with exchangeable and AR(1) correlation structure. However, the performances are close when the sample size is large and with 40% censoring rate, indicating that little advantage gains while specifying complicated correlation structure among pseudo-observations for variable selection. When the censoring rate is 60%, the TP decreases and the MSE increases for all the presented methods. Compared to the results based on LASSO and ALASSO, our proposed PGEE performs reasonably in identifying important variables with n = 600. The results with n = 400 and 1000 reveal a similar trend and are presented in Table 5 in Web Appendix E.
To investigate variable selection performance under the PHPH model, we generate the covariate vector following the PHMC model's configuration to specify the short-term and long-term effects. In this case, we set γ 0 = 0.85, γ = (0, 0, − 0.9, 0, 0, − 0.7, 0, 1, 0, . . . , 0), ϕ = ( − 0.5, 0, 0.8, 0, − 0.7, 0, 0, 0, 0, . . . , 0), and the baseline function F(x) = exp ( − 2x), resulting in a cure rate of 30%. The censoring times are independently generated from Uniform(0, c), where c is chosen so that either 10% or 30% of the censored subjects are susceptible. Tables 6 and 7 in Web Appendix E depict the simulation results for long-term and short-term effects based on different sample sizes. The definition of the MSE for the long-term and short-term effects is similar to that of the incidence and latency components mentioned in the above paragraph. We observe that the two constructions of pseudo-observations yield similar results. For each sample size n, the MSE of the PGEE is smaller than that of the full model but is larger than that of the oracle model. As n increases, the MSE of the PGEE decreases and is close to that of the Oracle model. Also, the TP increases and the FP decreases. However, as expected, the MSE increases and the TP decreases when the censoring rate increases to 60%. Finally, based on our simulation results, we observe minimal performance gains when specifying complicated correlations among pseudo-observations. With the same simulation specifications, we observe an improvement in FP and a loss in TP (results not shown here) when a one-standard error rule is used to select the optimum tuning parameter.
6 Data analysis

The melanoma data
We apply the proposed method to a melanoma dataset from the Eastern Cooperative Oncology Group phase III clinical trial e1684, 1 which is available from the R package smcure. 38 The primary objective is to determine whether the high dose interferon alpha-2b (IFN) regimen in postoperative adjuvant therapy would lead to a significantly prolonged interval of relapse-free for melanoma. The event of interest is the relapse of melanoma. The interested covariates include treatment (0=placebo, 1=IFN), gender (0=male, 1=female), and age (centered to zero). After excluding missing data, a total of 284 subjects is included in the analysis. The overall censoring rate is 30.9%. Figure 1(a) shows the KM estimates stratified by treatment and gender. The KM estimates level off at the end of the study, suggesting a fraction of nonsusceptibility to the recurrence of melanoma. This observation is confirmed by the Maller-Zhou test 30 with a p-value of <0.001. Based on the KM curves, a male has a higher cure rate than a female in the treatment group, whereas it is reversed in the control group, implying an interaction between the treatment and gender. Thus, the interaction term is also considered in the data analysis.
The top panel of Table 3 presents the results from the PHMC model obtained by (14) and (16) with g 1 (x) = log {x/(1 − x)}, g 2 (x) = log { − log (x)}, and ξ t h = log Λ 0 (t h ). The lower panel of Table 3 presents the results from the PHPH model obtained by (18) and (20) with g 3 (x) = log (x), g 4 (x) = log {− log (x)}, and ς t h = log { − log ( F(t h ))}. For comparison, we included the estimator 6 based on the EM-algorithm with standard errors obtained from 500 bootstrapped samples. The two constructionsπ i KM andπ i NP of the pseudo-observations for π yield similar patterns. For the incidence component, our proposed estimatesα PO NP andα PO KM are similar to the estimateα EM . The treatment has a significantly adverse effect on the susceptibility (noncured), which means the treatment substantially improves the relapse-free (cured) rate, especially for males. The positive age effect indicates that older patients tend to have a higher relapse rate of melanoma. In the latency component, none of the four covariates are significantly associated with the failure time if patients are susceptible. However, female patients tend to have a lower risk of recurring melanoma than male patients in the treatment group but higher in the control group. Under the PHPH model, the treatment has a significantly negative effect on the long-term effect, indicating that high dose IFN regimen increase the rate of relapse-free melanoma, especially for males. Older patients have a high possibility of recurring melanoma even though the effect is not statistically significant at the 5% nominal level. Those results are consistent with the findings in the incidence of the PHMC model. None of the four covariates are statistically significant at the 5% nominal level related to the short-term effect. The estimates indicate females are likely to have a more rapidly developing melanoma within the treatment group. However, the result is reversed in the control group, i.e., males are expected to experience melanoma sooner than females. Those findings are verified in Figure 1(a), where the KM estimate drops relatively faster for the females than for the males in the treatment group. In contrast, the KM estimates drop faster for the males than females in the control group.
To understand the covariate effect of cure rates in two models, we calculate the estimated cure rates in each group while holding the age variable at the average. The results presented in Table 6 of Web Appendix E suggests that there are higher cure rates among the treatment groups, echoing our finding that the treatment substantially improves the relapse-free (cured) rate. The estimated cure rates based onα PO NP andα PO KM under the PHMC model are close to that based on S KM (t max ); however, the results based onγ PO NP andγ PO KM under the PHPH model tend to have higher cure rate except for the female in the treatment group. This suggests that the PHMC model is more conservative in estimating cure rates. To assess the goodness of fit, Figure 2 in Web Appendix F presents the boxplots of pseudo-residuals for models (13) and (15) (17) and (19) under the PHPH model are also presented in Figure 3 of Web Appendix F. The top panel shows the residuals calculated based on pseudo-observationsθ i KM and its resulting estimators, and the bottom panel presents the pseudo-residuals calculated based on pseudo-observationsF i KM (t) at four given time points. Compared to Figure 2 in Web Appendix F, the pseudo-residuals under the PHPH model tend to have more considerable variations. This result might suggest that the PHMC model fits the data better than the PHPH model.

The dental data
We apply the proposed PGEE approaches to a dental dataset from the Creighton University School of Dentistry. 2 The dataset contains dental records from 5336 patients with periodontal disease collected between August 2007 and March 2013. In this work, the outcome of interest is the time to the first tooth loss due to periodontal reasons for each patient, yielding a censoring rate of 74.1%. The data analysis includes a total of 44 risk factors, whose detailed descriptions can be found in Tables 3 and 4 in Calhoun et al. 2 The length of the follow-up was 5.7 years, and the last event occurred at 5.37 years for both molar and non-molar groups. There were 35 and 20 teeth censored between the last event and the end of the study for the malor group and non-molars group, respectively. Figure 1(b) shows the KM survival curves stratified by the tooth type (3456 molars vs. 1880 non-molars) leveling off to nonzero probabilities, indicating a possible presence of a cured fraction in the population. This is also confirmed by the Maller-Zhou test 30 with a p-value of <0.001.
As the study aims to identify which factors are associated with tooth loss, the proposed PGEE provides a logical tool for risk factor screening. We perform the variable selection procedure in both the PHMC model and the PHPH model. Under the PHMC model, we apply the proposed PGEEs in (21) and (22) with pseudo-observations created byπ i KM andŜ i u (t), respectively. While under the PHPH model, we apply the proposed PGEEs in (23) and (24) with pseudo-observations created byθ i KM andF i KM (t), respectively. In both the PHMC model and the PHPH model, a working independence correlation is incorporated for the ten pseudo-observations obtained from the quantitles of observed event times. The tuning parameters are selected via five-fold cross-validation. For comparison, we also fit the PHMC model with LASSO and ALASSO penalties proposed by Masud et al. 27 implemented in the R package intsurv. 39 In the variable selection procedure, the molar variable's coefficient is not penalized since the variable effect is the research of interest. Tables 4 and 5 present the variable selection results. Under the PHMC assumption, when predicting whether a tooth is cured, the LASSO and ALASSO selected more variables as we observed in the simulation studies; see Table 2 and Table 5 in Web Appendix E. The  common selected factors based on presented methods include mobility score (mobil), bleeding on probing (bleed), periodontal probing depth (pocket_mean), decayed surfaces new (decayed_new), decayed surfaces recurrent (decayed_recurrent), endodontic therapy (endo), filled tooth (filled_tooth) and recurrent decayed surfaces (decay_recur_sum). Only the variable filled tooth increases the chance of cure while others decrease the chance of cure. In addition, for the latency, the PGEE tends to identify more risk factors in predicting tooth survival time. The mobility score (mobil), decayed surfaces new (decayed_new) and filled tooth (filled_tooth) are three common selected factors related to the tooth survival time based on PGEE and LASSO. Under the PHPH model assumption, our proposed PGEE approach suggests that factors plaque score (plaque), bleeding on probing (bleed_ave), plaque index (plaque_ave), filled surfaces (filled_sum), recurrent decayed surfaces (decay_recur_sum), decayed and filled surfaces (dfs_sum), number of decayed teeth (decayed_tooth_sum) and age at baseline (baseline_age) are importantly associated with the long-term effect. Moreover, 20 factors are identified in association with the short-term effect, indicating losing the tooth more rapidly. Interestingly, we observe that 15 variables are both selected by the PGEE in the latency component of the PHMC model and the short-term effect of the PHPH model even though covariates are interpreted differently in the two models.

Conclusions
This article extends the pseudo-observations approach to the context of right-censored survival data with a cure fraction for two popular cure models, the MC model and the BCH model. Several estimators for the regression parameters related to the cure rate and the risk of experiencing the event are proposed under the PHMC model and the PHPH model. The proposed methods allow researchers to estimate the covariate effects separately and identify essential factors associated with the cure rate and the risk of failure. Simulation studies show that the proposed methods perform reasonably under finite sample sizes. We also demonstrate the proposed methodology on two real applications involving survival data with a cure fraction.
In this work, for the MC model, we propose two different incidence regression estimatorsα PO KM andα PO NP . For the BCH model, we consider two different regression estimatorsγ PO KM andγ PO NP for long-term effects and two different regression estimatorsφ PO KM andφ PO NP for short-term effects. In the simulation studies, each of the pairs of estimators performs similarly. However, from the computing time point of view, we recommend that researchers consider estimators based on KM estimator as its computing times is fast. On the other hand, as we showed in the simulation studies, our proposed estimators based on pseudo-observations under both MC and BCH models have larger ESE and SEE than those from the usual MC and BCH models. This efficiency loss is commonly seen in the pseudo-observations literature. 12,40 As pointed out by a referee, the efficiency loss is a limitation for our proposed pseudo-observations approach on the cure models when the number of covariates is small, like our real data analysis in Section 6.1. However, modeling pseudo-observations constitutes a general and straightforward approach to simplify survival analysis. In our applications, the pseudo-observations approach brings several advantages. First, it is more flexible and feasible for parameter estimation and variable selection when the number of covariates is large. The estimating procedure can be performed separately for each component in the MC model and each effect in the BCH model. Second, the computing time based on the pseudo-observations approach is faster than standard approaches that use the EM algorithm for estimation and bootstrapping for standard errors. Finally, pseudo-residuals can be applied for model diagnosis via residual plots.
Our PHPH model links the short-term effect to the baseline survival function via a proportional hazard model. The proposed PHPH model can be extended to accommodate non-proportional hazard assumptions by considering an exponential transformation or an accelerated failure time model in the short-term effect. 8 On the other hand, we applied the pseudo-observations approach to the PHMC model. Another commonly used MC model is the accelerated failure time MC (AFTMC) model, 7 in which the logistic regression model is considered to model the cure status in the incidence component and the AFT model, log (T ) = X ⊤ β + ε, is used to model the conditional survival function in the latency component, where ε is the error term with survival distribution S ε (·). Under AFTMC model, our proposed pseudo-observations (6) and (7) using GEE approach can be directly applied to estimate the unknown parameters in the incidence component. For the latency component, we discuss the feasibility of using the pseudo-observations approach to estimate β under AFT model in two folds. First, when the survival distribution S ε (·) is known; that is, a parametric AFT model, one can write S −1 ε (S u (t)) = log (t) − X ⊤ β and treat it as a GLM with link function S −1 ε (·). To estimate β, one can create the pseudo-observations for S u (t) with given time points t ∈ {t 1 , . . . , t H } based on KM estimator as we proposed in equation (8) and then the GEE approach can be applied to obtain the estimates for β. However, one might need to program the estimating equation on their own because the geese function in the R package geepack 35 only provides commonly seen link functions. Second, when the survival distribution S ε (·) is unknown in a semi-parametric AFT model, it may not be straightforward to use the pseudo-observations approach to estimate β even though the pseudo-observations for S u (t) can be created. The main issue is the unknown survival function S ε (·), so is the inverse function S −1 ε (·). Further investigation is required and will be an interesting topic for future research.
We note some limitations of our proposed work for variable selection. Based on our simulation studies, the TP rate seems to be low for the incidence component of the MC model when n = 200. This results might be induced because only one time point is used to create the pseudo-observations for the cure rate. When applying our proposed approach to a small sample sizes like 200 in our simulation, researchers can investigate the stability of variable selection via the bootstrap approach proposed by Royston and Sauerbrei. 41 In this work, we only report the selected coefficient estimates from the penalized approach for the dental data analysis. In practice, one might adopt a two-stage approach in which the inference is based on the selected model to obtain standard errors. 27 The validity of inferences based on the penalization and the goodness-of-fit assessments will be studied in future work.