FormalPara Key Points for Decision Makers

Standard parametric survival models may not be suitable for survival extrapolation in cases of complex hazard functions, such as those seen in patients treated with CAR T-cell therapies. Several alternative survival extrapolation methods have been developed, but current guidance on which alternative method to use under which circumstances is limited.

Our analyses demonstrated that mixture cure and cubic spline models generate more accurate survival predictions than standard parametric and mixture models for CAR T-cell therapies in large B-cell lymphoma, even with limited follow-up data.

Goodness-of-fit criteria demonstrated a low correlation with predictive accuracy, suggesting that limited weight should be assigned to these criteria when selecting the optimal survival model.

1 Introduction

Clinical trials assessing the efficacy of new oncology treatments often have limited follow-up at the time of submission to health technology assessment (HTA) agencies. An understanding of patient outcomes beyond the maximum follow-up can have salient implications for future treatment options, as well as a profound impact on cost of care. Survival modeling can be a useful tool in extrapolating future outcomes based on available data, and has been used to inform cost-effectiveness analyses [1]. To determine the full clinical and economic value of a new treatment, extrapolation of observed survival is routinely required, and different extrapolation methods may lead to substantive variation in total survival gain and consequent value estimation. To effectively compare the cost effectiveness of available treatment options, it is critical to determine the predictive accuracy of the statistical models used to measure and compare survival. The most common extrapolation approach is to use standard parametric survival models (e.g., exponential, Weibull, gamma) [2]; however, these models might be unsuitable to capture survival after chimeric antigen receptor (CAR) T-cell therapy, given the potential for durable remission (e.g., 5-year survival) [3,4,5]. Durable remission has been observed in patients with relapsed or refractory (R/R) large B-cell lymphoma (LBCL), after treatment with salvage chemotherapy followed by high-dose chemotherapy and autologous stem cell transplantation [6] or CAR T-cell therapies [7,8,9]. Durable remission can be observed in the Kaplan–Meier (KM) curve, where a plateau indicates that the hazard converges towards the hazard in the general population [10].

Three autologous CAR T-cell therapies have received regulatory approval for R/R LBCL: axicabtagene ciloleucel (axi-cel; 2-year survival rate of 51%) [11], tisagenlecleucel (2-year survival probability of 40%) [12], and lisocabtagene maraleucel (liso-cel; 2-year survival probability of 51%) [13]. To capture the plateau observed in the KM curve for overall survival (OS), a hybrid survival model using a 1-knot spline model with a cure point was used in HTA submissions of tisagenlecleucel, whereas a mixture cure model was used in HTA submissions of axi-cel [14, 15]. Although the National Institute for Health and Care Excellence technical support document (NICE TSD) 21 provides a clear description of these flexible methods [16], it does not provide sufficient guidance on which method to use under what circumstances [17]. Goodness-of-fit (GOF) criteria are frequently used for model selection [2]. These criteria assess how well a model captures the observed survival, but a good statistical fit to the data does not necessarily yield an accurate, or even plausible, estimate of long-term survival, especially if data are immature [18]. This problem is exacerbated when modeling more complex hazard functions, as seen in R/R LBCL [19]. Therefore, our study aimed to retrospectively compare the predictive accuracy of different survival extrapolation methods (i.e., standard parametric distributions, mixture cure, cubic spline, and mixture models) and evaluate the validity of GOF criteria-based model selection using data from various follow-up time points from CAR T-cell therapies in R/R LBCL.

2 Methods

The JULIET (tisagenlecleucel, NCT02445248), ZUMA-1 (axi-cel, NCT02348216), and TRANSCEND NHL 001 (TRANSCEND; liso-cel, NCT02631044) trials evaluated the efficacy of CAR T-cell therapies in patients with R/R LBCL who had received two or more prior lines of systemic treatment. Various survival extrapolation approaches were applied to the three least mature database locks (DBLs) of JULIET and ZUMA-1 and the two least mature DBLs of TRANSCEND (Table 1), in which GOF criteria were evaluated to represent the anticipated data available at the time of HTA filing. Subsequently, predictive accuracy was assessed by comparing survival predictions based on the least mature DBLs to observed survival from the most mature DBLs (internal model validation) and by comparing the extrapolated survival curves to external data (external model validation). Lastly, mean lifetime survival (evaluated as survival until 100 years of age) was quantified for all models.

Table 1 Studies and DBLs included in the study

2.1 Data

Details of the JULIET, ZUMA-1, and TRANSCEND trials have been described elsewhere [7,8,9]. An overview of these trials and the included DBLs are provided in Table 1. For JULIET [8, 12, 19, 20] and ZUMA-1 [9, 11, 21, 22], DBLs were based on publicly available data. KM curves from the different DBLs were digitized using WebPlotDigitizer [23]. The widely accepted and validated algorithm developed by Guyot et al. [24] was used to reconstruct patient-level data. It is an iterative numerical method solving the inverted KM equations, using the digitized KM curve coordinates, the numbers at risk, and total number of events. Potential reliability issues of the algorithm, related to low resolution of KM curves and a lack of reported numbers of events and numbers at risk, did not apply to either trial. The associated KM curves were visually compared with the original KM curves. Furthermore, summary statistics such as survival probabilities and median survival were compared with the original KM data to ensure accuracy of this process. For TRANSCEND, the DBLs used in this study matched the DBLs used (internally) by Bristol Myers Squibb, and patient-level data of these DBLs were available.

2.2 Survival Extrapolation Methods

Four different extrapolation methods were used: standard parametric distributions, mixture cure models, cubic spline models, and mixture models. Standard parametric distributions were originally introduced for extrapolating survival data. Cubic spline models were introduced to overcome the inability of standard parametric models to adequately capture more complicated hazard functions seen in applied studies [15]. Mixture models allow for two or more latent clusters of patients with different survival profiles. Cure models are used in a situation where some individuals will never experience cause-specific death (i.e., their disease-related hazard rate will be zero and the OS curve plateaus at a non-zero value). The evaluated standard parametric distributions were exponential, Gompertz, Weibull, log-normal, log-logistic, gamma, and generalized gamma. Cubic spline models can accommodate more complex hazard patterns, with two or more turning points that can capture multiple important changes in the slope. Evaluated splines had one or two intermediate knots, equally distributed over uncensored log times, using a hazard, odds, or probit model for the transformed survival function [25]. The mixture cure and the mixture models used each represent the population as a mixture of two subpopulations. In the mixture cure models, those cured of the disease were assumed to have survival equal to that of the general population, and survival for those not cured was modeled using one of the standard parametric distributions, resulting in seven different mixture cure models [16]. In the mixture models, survival of both subpopulations was modeled using one of the standard parametric distributions, resulting in a total of 28 different mixture models. Mixture cure, cubic spline, and mixture models were included as per the NICE TSD 21 [26] and the OPTIOM guideline [17]. Standard parametric models were included for reference, as they are the most commonly used extrapolation method [2]. Piecewise models and response-based landmark models were also considered, but these were excluded because of the lack of a clinical basis for selecting a cut-off point and incomplete response-stratified survival data, respectively. Analyses were conducted in R [27], using the flexsurv and muhaz packages [26, 28]. All nonconverging models and models with a nonpositive Hessian matrix were excluded. For the application of background mortality, a relative survival framework was used for all models in line with NICE TSD 21 [16], based on United States 2018 lifetables [29] adjusted to the corresponding trial for age and sex.

2.3 Internal Model Validation

GOF criteria (Akaike information criterion [AIC] [30] and Bayesian information criterion [BIC] [31]) were assessed for all models. Subsequently, the most mature DBLs of the three trials were used for internal validation of the survival extrapolations of all fitted models, using the mean absolute error (MAE) and root mean squared error (RMSE). MAE and RMSE were evaluated per observed event from the date of CAR T-cell infusion (t = 0) to the last event of the most mature DBL.

2.4 External Model Validation

External validation of the survival extrapolations was performed by comparing the extrapolated survival curves to external data from SCHOLAR-1, a retrospective study evaluating outcomes in patients with refractory diffuse LBCL treated with salvage chemotherapy [6]. It was assumed that survival and cure rates with CAR T-cell therapy would be higher than with salvage chemotherapy. Therefore, two validation checks were performed based on SCHOLAR-1 data. First, models with a 5-year survival estimate lower than that observed in SCHOLAR-1 (5-year survival estimate of 16%), were deemed implausible. Second, mixture cure models with an estimated cure fraction <16% were also deemed implausible. In addition to the validation using the SCHOLAR-1 data, the extrapolations were validated by checking whether the disease-specific hazard converged to 0 after 5 years.

3 Results

The most mature survival data and the survival data used to fit survival models are visualized in electronic supplementary material (ESM) Figs. S1 and S2 and Fig. 1 for JULIET, ZUMA-1, and TRANSCEND, respectively. The importance of selecting an appropriate survival model was demonstrated by the large variation seen in the 5-year survival estimates of the different extrapolation models, ranging from 0 to 45%, 10 to 55%, and 0 to 47% over all DBLs of JULIET, ZUMA-1, and TRANSCEND, respectively (Table 2 and ESM Tables S1‒S7). Survival and hazard plots of all models fitted to the different DBLs of the different trials are reported in the Appendix (ESM Figs. S3‒S34). All models converged, except for 36% of the mixture models. The nonconvergence of some mixture models was not limited to specific distributions, trials, or DBLs, although nonconvergence was the highest in the earliest DBL (39%).

Fig. 1
figure 1

Kaplan–Meier curves of the different DBLs of TRANSCEND NHL 001. Shading indicates 95% confidence intervals, and the table below the Kaplan–Meier curves indicates the number of patients at risk for each time point. DBL database lock

Table 2 Characteristics of models fitted to the first database lock of TRANSCEND NHL 001

3.1 Performance of Standard Parametric Models

Standard parametric models had poor predictive accuracy, as measured using the MAE. Among these, the generalized gamma distribution performed best and the exponential distribution performed worst in all trials and most DBLs. Notably, the generalized gamma distribution is the most flexible model due to its three parameters and resulting ability to capture turning points in the hazard in contrast to the exponential distribution, which only has one parameter and assumes a constant hazard. The MAE results from the JULIET trial showed poor performance of the standard parametric models compared with the other extrapolation methods (Fig. 2). MAE results from the ZUMA-1 and TRANSCEND trials are shown in ESM Figs. S35 and S36, respectively. With increasing data maturity, the improvement in performance of the standard parametric models was relatively minor compared with the more flexible models. For example, in JULIET, the MAE of standard parametric models was between 0.027 and 0.132 for the first DBL and between 0.023 and 0.114 for the third DBL (Fig. 2, ESM Tables S1 and S3). A similar pattern was observed in ZUMA-1 and TRANSCEND (ESM Figs. S35 and S36). The external validation showed that 48%, 24%, and 7% of all standard parametric models fitted to the first, second, and third DBL, respectively, across all trials, projected a 5-year survival rate lower than that observed in SCHOLAR-1, indicating that these models did not pass the external validation checks (Table 2 and ESM Tables S1‒S7). In particular, the exponential, Weibull, and gamma distributions performed poorly in the external validation. For example, in the first, second, and third DBLs of the JULIET trial, the exponential distribution estimated a 5-year survival rate of 3%, 4%, and 12%, respectively, all of which fall below the 16% 5-year survival observed in SCHOLAR-1 and making their use clinically and scientifically implausible.

Fig. 2
figure 2

MAE of the different models fitted to the first, second, or third DBL of JULIET. Lower MAE scores indicate better predictive accuracy. DBL database lock, MAE mean absolute error, SPM standard parametric models

3.2 Performance of Mixture Cure, Cubic Spline, and Mixture Models

The mixture cure models had MAEs across all DBLs of 0.010‒0.050, 0.013‒0.085, and 0.015‒0.024 for JULIET, ZUMA-1, and TRANSCEND, respectively. The cubic spline models had lower MAEs compared with the standard parametric models, including ranges across all DBLs of 0.013‒0.035, 0.023‒0.128, and 0.014‒0.030 for JULIET, ZUMA-1, and TRANSCEND, respectively. Among the mixture models that converged, there was a large variation in the MAE in the first DBLs of JULIET and ZUMA-1. Even with this variation, especially in the more mature data, mixture models performed better than the standard parametric models. The MAE of the mixture models was 0.100‒0.123, 0.014‒0.176, and 0.012‒0.022 for JULIET, ZUMA-1, and TRANSCEND, respectively (Table 2 and ESM Tables S1‒S7).

With increasing data maturity, mixture cure and cubic spline models continued to have high predictive accuracy (ESM Tables S1‒S8). Mixture models showed an increase in predictive accuracy with increasing length of follow-up. For example, the mixture models resulted in an MAE of 0.025‒0.176 in the first ZUMA-1 DBL, which decreased (i.e., improved) to 0.014‒0.050 and 0.029‒0.044 in the second and third DBLs, respectively (ESM Fig. S35 and ESM Tables S4‒S6). All mixture cure and cubic spline models generated 5-year survival estimates that exceeded SCHOLAR-1. For the mixture models, 24%, 2%, and 6% of the models did not meet this requirement for JULIET, ZUMA-1, and TRANSCEND, respectively.

Spline models fitted to the earliest DBLs estimated lower 5-year survival than mixture cure models (ESM Figs. S37‒S39). However, the slope in the tail of the spline models became very shallow, particularly with more mature data, nearly converging to mixture cure models for the period for which data were available (Fig. 3 and ESM Figs. S37‒S39). This trend was also observed in the minimum clinical validation threshold 5-year survival estimates of the mixture cure and cubic spline models. The mixture cure models showed little change in mean 5-year survival estimates with increasing data maturity, but the differences between the 5-year survival estimates of the different mixture cure models decreased (Table 2 and ESM Tables S1‒S7). In contrast, the 5-year survival estimates of the cubic spline models gradually increased as data maturity increased. For example, in ZUMA-1, 5-year survival estimates for the first, second, and third DBL were 25‒55%, 47‒51%, and 47‒51%, respectively, for mixture cure models, and 16‒34%, 34‒45%, and 43‒47%, respectively, for cubic spline models (ESM Tables S4‒S6). Despite this convergence, the mean lifetime survival estimated by mixture cure and cubic spline models differed substantially. For example, for models fitted to the third DBL of ZUMA-1 (Fig. 3), mean lifetime survival estimates ranged from 19.7 to 21.3 years for mixture cure models and from 12.0 to 18.2 years for cubic spline models (ESM Table S6).

Fig. 3
figure 3

Comparison of the cubic spline models and the mixture cure models fitted to the third DBL of ZUMA-1. DBL database lock

3.3 Correlation Between Goodness of Fit and Predictive Accuracy

Model selection using GOF criteria did not maximize predictive accuracy, especially for early DBLs. This was demonstrated by the limited correlation observed between AIC/BIC and MAE (Fig. 4 and ESM Figs. S40‒S47). Similar results were obtained when comparing AIC/BIC values with RMSE (Table 2 and ESM Tables S1‒S7). For example, for the first DBL of TRANSCEND, the three models with the lowest BIC that passed external validation checks were the exponential distribution, exponential mixture cure model, and log-logistic mixture cure model. In contrast, the models with the lowest MAE were the generalized gamma/log-logistic, generalized gamma/gamma, and Gompertz/Weibull mixture models (Table 2). Among the models that passed external validation checks, mean lifetime survival estimates for the models with the lowest BIC were 7.6, 4.2, and 17.0 years, and 8.7, 16.7, and 15.5 years for the models with the lowest MAE, for the first DBLs of JULIET, ZUMA-1, and TRANSCEND, respectively (ESM Tables S1 and S4, and Table 2). This demonstrates that GOF criteria-based model selection can result in selecting models that substantially overestimate or underestimate mean lifetime survival.

Fig. 4
figure 4

Correlation between MAE and BIC in the models fitted to the first DBL of TRANSCEND NHL 001. BIC Bayesian information criterion, DBL database lock, MAE mean absolute error, SPM standard parametric models

4 Discussion

Our analyses demonstrated that mixture cure and cubic spline models can achieve a higher predictive accuracy for OS of CAR T-cell therapies in R/R LBCL compared with standard parametric and mixture models, even when data are relatively less mature. Furthermore, GOF criteria demonstrated a low correlation with predictive accuracy, suggesting that other criteria should be prioritized for model selection. Approximately one-third of the mixture models did not converge, and among the models that did converge, there was much variation in predictive accuracy for early DBLs. Both the nonconvergence and the variance in predictive accuracy are likely a result of the low number of patients and the low number of events, suggesting that mixture models may not be suitable for survival extrapolations of early DBLs. Cubic spline models, especially when fitted to more mature data, had a hazard almost identical to that of the general population for the period beyond the trial follow-up, behavior typically seen in cure models. One limitation of cubic spline models may be the subjectivity of the number and placement of the knots. In this study, we evaluated cubic spline models with one or two intermediate knots, where the knots were equally distributed over uncensored log time. As durable remission can be achieved in R/R LBCL, using mixture cure models is clinically justified, and therefore may be a more appropriate survival extrapolation method for CAR T-cell therapies in R/R LBCL. It has been suggested that long-term survivors in LBCL have minor excess mortality [32]. In the NICE submissions of tisagenlecleucel, a standardized mortality ratio of 1.09 was applied to background mortality to model survival after the cure point to reflect this [15, 20]. Although a relative survival framework was adopted, our model estimations did not account for excess mortality for long-term survivors, which is in line with the updated submission of axi-cel [14]. Incorporating a standardized mortality ratio of 1.09 would only have a minor impact on the overall lifetime survival estimates and it would not affect relative model performance, as the adjustments would apply equally to all models.

One major strength of our study is that we compared several extrapolation methods for multiple DBLs from three different trials of CAR T-cell therapies in R/R LBCL, allowing us to study to what extent the optimal extrapolation method differs between trials and DBLs in a similar clinical setting (i.e., cellular therapies in the same indication). However, there are two limitations of note in our study. First, we did not consider progression-free survival (PFS). Survival extrapolations on OS may underestimate the cure fraction owing to the lack of maturity of the OS data. In practice, a plateau in PFS is reached earlier than for OS, which can provide information about the cure fraction. However, in JULIET and TRANSCEND, disease progression was determined by an independent review committee, whereas in ZUMA-1, disease progression was determined by investigator assessment. This difference in disease progression assessment creates bias when analyzing PFS data across the three trials, which is why we did not consider PFS in this study. Second, even though we used the most mature data available at the time these analyses were conducted, data maturity was still limited, with minimum follow-up times of 26, 47, and 18 months for JULIET, ZUMA-1, and TRANSCEND, respectively. Since the time of these data analyses, longer follow-up data for ZUMA-1 have been presented [33]. When more mature data for all three trials become available, stronger conclusions about predictive accuracy may be drawn than were drawn based on the current study.

In a recent study, Vadgama et al. [34] compared the performance of standard parametric models, mixture cure models, cubic spline models, and nonmixture cure models fitted to the first, second, and third DBLs of ZUMA-1. Our study is more comprehensive, as it uses data from JULIET, ZUMA-1, and TRANSCEND and also includes mixture models. Furthermore, Vadgama et al. [34] used a different measure to evaluate the performance of the survival extrapolation methods, as they compared predicted survival at 48 months with 48-month survival observed in the most mature data. This method can be used to evaluate the difference between the predicted and observed survival at one specific time point, whereas the MAE and RMSE used in this study account for the difference in the predicted and observed survival over the entire period for which data are available (e.g., from receiving CAR T-cell therapy to the date of the last event in the most mature data). Despite the differences between this study and the study by Vadgama et al. [34], both studies concluded that cure-based models provided the most accurate extrapolations of long-term survival for patients with R/R LBCL treated with CAR T-cell therapies, which is also in line with studies of other immuno-oncology therapies [35, 36]. Additionally, in the final NICE appraisal of axi-cel, the independent academic reviewer group agreed that mixture cure models estimated on the 60-month OS data of ZUMA-1 provided good fits and were rightfully selected to model OS.

5 Conclusions

Standard parametric models are inappropriate for extrapolation of survival data for CAR T-cell therapies in LBCL. Mixture cure and spline models provided the best predictive accuracy, but mixture cure models may be more appropriate given the clinical relevance of their underlying assumptions to LBCL. Furthermore, due to the reduced correlation between GOF criteria and predictive accuracy observed in our analysis, limited weight should be assigned to GOF criteria when selecting the optimal survival model for endpoints with complex hazard functions, particularly when this is due to a subset of patients achieving long-term remission and data are immature. Although this study focused on CAR T-cell therapies in patients with R/R LBCL who received two or more prior systemic treatments, we postulate that the results of the current study may also be applicable to earlier treatment lines and other oncology indications in which lasting remission is observed, as the hazard functions are expected to be similar.