Inadequacy of existing clinical prediction models for predicting mortality after transcatheter aortic valve implantation

Background The performance of emerging transcatheter aortic valve implantation (TAVI) clinical prediction models (CPMs) in national TAVI cohorts distinct from those where they have been derived is unknown. This study aimed to investigate the performance of the German Aortic Valve, FRANCE‐2, OBSERVANT and American College of Cardiology (ACC) TAVI CPMs compared with the performance of historic cardiac CPMs such as the EuroSCORE and STS‐PROM, in a large national TAVI registry. Methods The calibration and discrimination of each CPM were analyzed in 6676 patients from the UK TAVI registry, as a whole cohort and across several subgroups. Strata included gender, diabetes status, access route, and valve type. Furthermore, the amount of agreement in risk classification between each of the considered CPMs was analyzed at an individual patient level. Results The observed 30‐day mortality rate was 5.4%. In the whole cohort, the majority of CPMs over‐estimated the risk of 30‐day mortality, although the mean ACC score (5.2%) approximately matched the observed mortality rate. The areas under ROC curve were between 0.57 for OBSERVANT and 0.64 for ACC. Risk classification agreement was low across all models, with Fleiss's kappa values between 0.17 and 0.50. Conclusions Although the FRANCE‐2 and ACC models outperformed all other CPMs, the performance of current TAVI‐CPMs was low when applied to an independent cohort of TAVI patients. Hence, TAVI specific CPMs need to be derived outside populations previously used for model derivation, either by adapting existing CPMs or developing new risk scores in large national registries.


Introduction
Despite surgical aortic valve replacement (SAVR) being the definitive treatment strategy for severe symptomatic Aortic Stenosis (AS), a significant proportion of patients are not offered surgery due to co-morbidities or frailty that contribute to high surgical risks and adverse outcomes in such patient groups (1). Transcatheter aortic valve implantation (TAVI) has emerged as an efficacious but less invasive treatment option in high and intermediate operative risk patients (2)(3)(4)(5). As such, treatment allocation between medical management, SAVR and TAVI depends on multiple factors, but key is the assessment of the patient's procedural risk. Clinical prediction models (CPMs), which quantify the risks associated with the proposed treatment strategy at an individual patient level, can aid heart-teams in this clinical decision-making process and are vital for audit purposes between TAVI centres.
Cardiac surgery CPMs for short-term mortality prediction, such as the European System for Cardiac Operative Risk Evaluation Score (EuroSCORE) (6,7) and the Society of Thoracic Surgeons Predicted Risk of Mortality (STS) model (8), have been used to identify high-risk patients in randomised trials of TAVI (2,3). However, these surgical CPMs perform poorly in predicting risk after both SAVR and TAVI, as exemplified in the Placement of Aortic Transcatheter Valves (PARTNER) cohort A trial where there was large disagreement between the observed and STS-expected 30-day mortality (3). Moreover, several cohort studies have shown the inaccuracy of the surgical CPMs in predicting mortality after TAVI (9)(10)(11).
Consequently, TAVI specific CPMs are beginning to emerge from large cohorts of TAVI patients (12)(13)(14)(15). In particular, the German Aortic Valve Score (German AV) was developed using patients who underwent either surgical replacement or TAVI (13), while TAVI-specific CPMs have been derived in the France TAVI registry (FRANCE-2 model) (14), the Italian TAVI registry (OBSERVANT model) (12) and the Society of Thoracic Surgeons/ American College of Cardiology Transcatheter Valve Therapy registry (ACC model) (15). However, the performance of the aforementioned TAVI-CPMs in large cohorts of patients outside of their derivation cohorts is unknown. Hence, it is unclear if they can be reliably used in other national settings.
Therefore, the aim of this study was to investigate the performance and agreement of the German AV, FRANCE-2, OBSERVANT and ACC TAVI-CPMs for predicting 30-day mortality outside their development cohorts, to examine if the performance was sufficient for them to be used for this purpose. The study compared the TAVI-CPM performance against surgical CPMs, namely the Logistic EuroSCORE (LES), EuroSCORE II (ESII) and STS score.

UK TAVI Registry
Prospectively collected data on every TAVI procedure in the United Kingdom from January 2007 to December 2014 were obtained through the UK TAVI registry (16). By the end of 2014, 34 UK centres were performing TAVI procedures with multi-disciplinary teams of cardiologists, surgeons and other health-care professionals at each centre deciding on patients' suitability for TAVI (16). The web-based registry comprises 95 variables detailing patient demographics, risk factors for intervention, procedural details and adverse outcomes up to the time of hospital discharge. All-cause mortality tracking was obtained from the Office for National Statistics (ONS) providing the life-status of English and Welsh patients (two countries of the UK). Mortality tracking was unavailable for patients in Northern Ireland and Scotland and as such, these patients were removed from the analysis.

Statistical Analysis
Multiple imputation was used for missing values, with ten datasets imputed (17).
Missing life-status was not imputed and this analysis excluded any patient who had such a missing endpoint. To avoid underestimation of covariate-outcome associations, 30-day mortality indication was used in the imputation models for missing covariates (18). Further details of the imputation procedure are given in the supplementary material.
The risk of 30-day mortality implied by each CPM was retrospectively calculated for each patient based on the published regression coefficients (6)(7)(8)(12)(13)(14)(15). This analysis used clinical reasoning to make assumptions regarding translation between variable definitions in the published CPMs and those in the UK TAVI dataset. Any CPM risk-prediction variable that was not recorded in the UK TAVI registry was assumed risk factor absent for all patients. The full translation between each CPM and the TAVI registry variables is given in The performance of each CPM was assessed in terms of calibration and discrimination. Calibration is the agreement between the expected and observed event rates across the full risk range; discrimination is the ability of the CPM to distinguish between those who will experience an event and those who will not. Discrimination of the risk models was analysed using the area under the receiver operating characteristic (ROC) curve, with values between 0.5 and 1 where higher values indicate better discrimination. To examine the calibration of each CPM, a logistic regression model was fitted with the event indicator as the outcome and the linear predictor from the CPM as the only covariate (19). Perfect calibration would occur when the corresponding intercept and slope are zero and one respectively, with the intercept estimated assuming a slope of unity. Furthermore, the Brier Score was used as a measure of overall performance, with values between 0 (perfect prediction) and 1 (worst prediction) (20); the McFadden's pseudo-R 2 was calculated to give an indication of the explained variation in the data. CPM performance was analysed in the whole cohort and within several subgroups. The following subgroups were considered: age (≤ or > 75), sex, diabetes status, access route (Transfemoral vs. non-Transfemoral), valve type (SAPIEN TM vs. CoreValve TM ), previous coronary artery bypass graft (CABG) status, LV function (LVEF< 50% or LVEF ≥ 50%) and procedure urgency (elective vs. non-elective).
Patient-level risk agreement between CPMs was analysed in the surgical models and the TAVI models separately to facilitate fair comparisons. It was decided, a priori, to derive Ccut-off values for each CPM were used tothat defined three risk levels (low-, medium-and high-risk), with approximately equal patient numbers of patients in each. The proportions of patients for whom risk classification agreed between multiple CPMs was then calculated.
Additionally, Fleiss's Kappa was calculated in the surgical and TAVI models (21). A sensitivity analysis was conducted in which the risk stratifications were re-defined to give a population ratio of 1:3:1 for low-, medium-and high-risk, respectively. R version 3.3.1 (22) was used for all statistical analyses. Multiple imputation of the dataset was completed using the mice package (23), graphical plots were made using the ggplot2 package (24) and the package pROC was used for constructing ROC curves (25).
The Health e-Research Centre, funded by the Medical Research Council [MR/K006665/1] and the North Staffordshire Heart Committee supported this work. The authors are solely responsible for the design and conduct of this study, all study analyses, the drafting and editing of the paper and its final contents.

Results
The UK TAVI registry included all 7431 patients who underwent a TAVI procedure between January 2007 and December 2014. All patients from Northern Ireland (n=400) and 8 the majority of Scottish patients (n=193) were excluded from the analysis due to absence of ONS mortality tracking. Out of the remaining 6838 patients, a further 162 were removed due to missing life status, leaving 6676 patients studied in this analysis. The observed survival rates were 94.6%, 83.3% and 64.4% at 30-days, one-year and three-year follow-up, respectivelyMedian follow-up time was 1.9 years (inter-quartile range 0.95-3.3 years). Table   1 presents summary statistics for baseline characteristics of the patients in the UK TAVI registry.

Performance Analysis
From January 2007 to December 2014, there were 360 deaths within 30-days of the TAVI procedure (5.4%). The expected 30-day mortalities in the whole cohort were 21.9%, 8.1%, 5.1%, 7.4%, 9.2, 7.1% and 5.2% from the LES, ESII, STS, German AV, FRANCE-2, OBSERVANT and ACC CPMs, respectively ( Table 2). The ACC score and STS model were the closest to the observed mortality in terms of absolute and relative differences, while the LES overestimated risk by a factor of four ( Table 2) Supplementary Table 8. Table 3 shows the performance of each CPM in the whole cohort. While the calibration intercepts of the ACC and STS models were significantly close to zero (i.e. the observed and expected mortalities agreed), the 95% confidence intervals for the calibration slopes did not span one, indicating model miscalibration. Poor discrimination was observed, with area under the ROC curves between 0.57 and 0.64 for the whole cohort; the FRANCE-2 TAVI score and the ACC TAVI score had the highest AUC values of 0.62 and 0.64, respectively. Overall performance, as measure by the Brier score, was similar for the majority of models with values of 0.05; a Brier score of 0.09 for the LES was the highest (worst) amongst the models. Quantitatively similar results were obtained from a sensitivity analysis that excluded patients who underwent TAVI in 2007 or 2008 (n=337) where the observed mortality was elevated over that in subsequent years (Supplementary Table 9).
The performances of all the CPMs in each subgroup are given in the supplementary material (Supplementary Table 10). The expected mortality from the ACC TAVI model was significantly close to the observed mortality across all strata, but satisfactory calibration (calibration intercept and slope close to zero and one, respectively) was only observed for this CPM in female and diabetic subgroups. All other models were miscalibrated across strata.
The area under the ROC curve was below 0.7 for all CPMs across the subgroups, with the majority close to 0.6; the ACC and FRANCE-2 CPMs had the highest discrimination across subgroups.

Agreement Analysis
Cut-off values were selected for all CPMs that gave approximately equal numbers of patients in low-, medium-and high-risk categories. The chosen cut-off values that gave approximately equal numbers of patients in low-, medium-and high-risk categories are given in Table 4. Based on these cut-off values, the proportions of patients in classified in each risk level who were similarly classified across the other CPMs were calculated (Figure 2 for the surgical based CPMs and Figure 3 for the TAVI based CPMs). A low level of agreement at an individual patient level was observed; for example, only 31.8% of the 1951 patients grouped as high-risk group by FRANCE-2 >10% were also grouped as high-risk by the OBSERVANT and ACC models (Figure 3). Quantifying agreement between the CPMs using Fleiss's Kappa ( ), highlighted that agreement between all the surgical scores was moderate ( = 0.37), while that between all the TAVI models was poor ( = 0.20). The pairwise Fleiss's Kappa values are given in Table 4, which shows that there was moderate agreement between the FRANCE-2 and ACC TAVI models ( = 0.33). Risk stratifications were redefined to give a population ratio of approximately 1:3:1 for low-, medium-and high-risk.
Here, the results indicated marginally improved levels of agreement, but these were still moderate. Specifically, the Fleiss's Kappa across the surgical scores was 0.40 and that between the TAVI models was 0.20, with pairwise Fleiss's Kappa values given in Supplementary Table 11.

Discussion
Clinical prediction models form the cornerstone of risk stratification for patients undergoing invasive procedures, helping to guide both treatment allocation and the consent process. However, their performance needs to be tested in large datasets independent to those in which the models were developed before they can be used in external populations (26,27).
Our analysis of the UK TAVI registry has systematically demonstrated that outside their development cohorts, the German AV, FRANCE-2, OBSERVANT and ACC TAVI CPMs are miscalibrated and have low discrimination at predicting 30-day mortality. These results support previous work in this area (28). In the current study, Tthe FRANCE-2 and ACC models had the highest discrimination out of all those considered, with the discrimination of the FRANCE-2 modelthese comparing favourably to the the internal validation results results reported reported when this these models was were derived (14,15). Additionally, although the ACC model was miscalibrated, the expected mortality was significantly close to the observed mortality across all subgroups considered in this analysis. However, of note is that the ACC model was predominately developed to predict in-hospital mortality, which potentially contributes to the agreement between the observed and expected event rates for this model.

11
The performance of any CPM is expected to drop when they are applied in populations external to the development set since patient mix and procedure techniques can vary between populations (26,27,29). Consequently, the findings of the current study are, perhaps, unsurprising given that the TAVI-CPMs achieved only moderate performance in their respective development datasets (12,14,15). Current TAVI cohorts predominantly represent a particularly high-risk and homogenous group of patients, potentially contributing to the lack of a highly predictive TAVI-CPM. Future TAVI-CPMs need to be developed by utilising the contemporary large registries that are emerging, which will inevitably require greater harmonisation between variable and outcome definitions amongst national datasets.
Moreover, many of the co-morbidities used in the development of CPMs are cardiovascular risk factors, with important non-cardiovascular co-morbidities not considered (30). In particular, frailty is not reflected in many of the CPMs, despite being particularly prevalent in elderly patients with AS and previous work suggesting frailty to be associated with poor TAVI outcomes (31,32). A CPM that aims to predict long-term mortality following TAVI found that the inclusion of frailty in their model significantly increased the discrimination (33). Similarly, a previously published CPM that aims to predict mortality and/ or a decline in quality of life following TAVI included an indication of 6-min walk test distance (34).
The present study indicated that the 30-day mortality was elevated in 2007 and 2008 over that in subsequent years, but the sensitivity analysis that excluded 2007/08 procedures indicated similar results to the main analysis. Previous studies have shown a learning curve associated with TAVI, but centre/ operator volume and outcome relationships remain debated (35)(36)(37). Nevertheless, measures of operator volume or experience are not used in CPMs since accounting for such variables would be inappropriate, particularly when the purpose of a CPM might be to benchmark an individual operators / centres performance. Similarly, the addition of operator volume/ experience in a CPM would make it almost impossible for a physician to convey the predicted risk to a patient.

Comparison with Performance of the Surgical CPMs
The current study confirms previous work in showing that the performance of the LES, ESII and STS models are poor at predicting 30-day mortality post TAVI (9)(10)(11).
Despite being poor, the STS model outperformed the other surgical models, with the STS expected 30-day mortality rate not significantly different from the observed 30-day mortality rate. This finding has been previously observed (9,11) and is perhaps attributable to the fact the STS score has a specific model for isolated valve surgery (8). Of note, previous TAVI registries have reported mean STS values higher than that found in this study, perhaps due to the assumptions made in our study regarding the calculation of the STS model. For example, the FRANCE TAVI registry reported STS values of around 18%, while the Italian CoreValve registry reported values of 11% (38,39).
Nonetheless, comparing the surgical CPMs to the TAVI-CPMs highlights that the latter performed better than the former when internally validated (12,14) and the current study shows that the FRANCE-2 and ACC models outperformed the surgical scores. Surgical CPMs are limited in their use in transcatheter procedures because they were derived from surgical populations. Not only are the procedural risks of TAVI different from those in SAVR, but there is a lack of grading between the severities of co-morbidities in the surgical CPMs. For example, chronic obstructive pulmonary disease (COPD) is a risk factor in LES, but there is no further distinction between the severity of COPD or even other severe lung disease. Since the heart-team considers such severities when deciding between SAVR and TAVI, grading of co-morbidities should be included in future TAVI-CPMs.

Patient-level Agreement Analysis
This study highlighted that the classification of patient risk varies between multiple CPMs, even when comparing surgical and TAVI based CPMs separately. A Pearson correlation coefficient of 0.56 has previously been reported between the LES and STS score (10), with similar correlation between these scores reported in other studies (11). Such an analysis does not necessarily indicate the level of agreement between two risk models, since the correlation is only assessing the linear relationship between them (40). Although the current study found higher agreement between the surgical models than between the TAVI models, this was driven by the ESII being an updated version of the LES. The lack of agreement between the scores further highlights previous published recommendations that risk assessment should be based on heart-team discussion in combination with multiple CPMs (4).

Limitations
A limitation of the current work is that assumptions were required when linking the definitions of model variables with the TAVI dataset, as described in the Supplementary Material. For example, the lowest LV function category in the ESII model is LVEF<20% whereas that in the UK TAVI dataset is LVEF<30%, with this analysis assuming these definitions to be equivalent. Such assumptions are an artefact of different recording practices between national registries. Accordingly, some of the surgical CPMs could not be calculated exactly as they were published, which could induce bias into the calculated predicted risks.
This study used Ssurrogate variables were used to mitigate this wherever possible and all assumptions were made to reflect the TAVI procedure as accurately as possible. As noted above, the calculated STS score in this study is lower than previously reported values from other TAVI registries. Lack of variables including Mmitral vValve, hypertension and severity of pulmonary disease variables could have contributed to this, although but our findings compare favourably to previous work. Similarly, the assumption of risk factor absent for variables that were included in CPMs but not recorded in the UK TAVI registry (e.g. mitral valve replacement or infective endocarditis) may induce bias, but any such bias is likely to be negligible given the variables where this assumption was needed.

Implications for Future Work
Based on this work, the development of further TAVI-CPMs is recommended in populations of interest. Although there is an indication of feasibility of TAVI in intermediate risk patients (5), TAVI-CPMs are still required, especially for procedure audit purposes and risk stratification analyses. Rather than developing new scores from scratch, model updating techniques could be applied to the current TAVI-CPMs to adapt them to new national cohorts (41). For instance, re-fitting the current models to the population of interest and/or the addition of new risk factors, such as frailty, could improve prediction (31,42). Further work in this area is recommended. Secondly, developing TAVI models that predict both short-and long-term outcomes would be particularly valuable, especially if they included a measure of futility.

Conclusions
The FRANCE-2 and ACC TAVI models had the highest performance across all CPMs considered. However, all the CPMs had low calibration and discrimination, reducing their suitability for risk stratification outside their development cohorts. Future iterations of existing TAVI models may benefit from including non-cardiovascular co-morbidities such as frailty. , rendering them unsuitable for use for risk stratification outside of their development populations. Hence, tThe derivation of new TAVI-CPMs in contemporary large registries is recommended, but it remains to be determined if this is best achieved by updating/ revising 15 existing TAVI scores, by developing new CPMs in specific cohorts, or a combination of the two.   Each bar represents a risk stratification by one of the TAVI-CPMs, with the segments of that bar showing the proportion of patients that were also grouped in that risk strata by none, one or both of the other TAVI-CPMs Tables   Table 1.