SMART choice (knee) tool: a patient‐focused predictive model to predict improvement in health‐related quality of life after total knee arthroplasty

Current predictive tools for TKA focus on clinicians rather than patients as the intended user. The purpose of this study was to develop a patient‐focused model to predict health‐related quality of life outcomes at 1‐year post‐TKA.


Introduction
Publications on predictive models in the orthopaedic literature have grown exponentially in the last decade. 1 In part, this may be due to the increasing availability of large registries and datasets that are needed to produce predictive models. 2 However, more choices of predictive models may not necessarily translate to more benefits for patients. With over 20 major predictive models for total knee arthroplasty (TKA) described in the literature, [3][4][5][6][7][8][9][10][11][12][13][14] the apparent overlapping functions of these models may lead to confusion for both clinicians and patients. 15 In 2021, Farrow et al. published an annotation describing how predictive models in orthopaedics should be reported by authors and subsequently interpreted by clinicians. 16 From this publication, it is clear that a clinical gap needs to be identified before a predictive model is developed. This is in contrast with earlier research where machine learning models were developed as 'proofs-of-concepts' rather than proven solutions to clinical problems. 17 Based on advances in predictive modelling, the paradigm has now shifted from 'how do we develop predictive tools' towards 'how do we develop predictive tools that can be implemented into clinical practice'. 17,18 In addition, the consideration of who the intended user for a predictive tool is important. The literature remains skewed towards predictive tools intended for use by clinicians rather than patients. 1 This particular finding is a distinct divergence from the widely adopted shared decision-making model of modern healthcare. 19 If most predictive tools are still being developed for clinicians to use solely, how can the implicit biases of clinicians be separated from the patient when making informed decisions about health? Consequently, there remains a need to develop predictive tools for patients to use early in their TKA journey before seeing a surgeon.
Furthermore, the evaluation of predictive tools requires more than statistical validation. 20 In the context of arthroplasty, most predictive tools are never evaluated in a clinical environment yet are promoted based on statistical metrics alone. 21 Without data from clinical trials, it would be difficult to be certain of the effects that predictive tools can have on patient care and outcomes. 20,22 In response to these considerations, we have developed the SMART Choice (Knee) tool. This is a predictive tool that aims to predict health-related quality of life (HRQoL) outcomes for patients considering TKA. The tool can be used by patients before seeing a clinician. The intention for the tool is not to replace clinician decision making, but rather to improve patient decision making through the dissemination of knowledge. The rationale for developing this tool stems from previous work which notes one of the most significant predictors of post-TKA dissatisfaction is pre-TKA symptom state. 23,24 In other words, patients with milder pre-surgery symptoms are less likely to be satisfied after surgery. Therefore, when using predictive tools like SMART Choice, the decision for patients rests on whether to undergo TKA in their current symptom state or delay their surgery whilst waiting for their knee symptoms to worsen (and probability for success after surgery to improve).
With these factors in mind, our hypothesis is that use of the SMART Choice tool may influence patients to make more informed decisions about undergoing surgery, and for some patients, change their willingness for surgery altogether. Additionally, we expect that our predictive tool may help to align patient expectations of surgery towards a more realistic outcome. This may in turn reduce patient dissatisfaction through the alignment of patient expected outcomes with actual outcomes after TKA. 25,26 To test our hypotheses, we have developed this tool to be tested in subsequent clinical trials. 27

Methods
Ethics approval and governance were received from the St. Vincent's Hospital Ethics Committee (HREC285/21) and The University of Melbourne Ethics Committee (2021-23157-24498-3). The study is reported in concordance with the TRIPOD (Transparent Reporting of Multivariable Prediction Model for Prognosis or Diagnosis) guidelines. 28

Data source
The study used prospectively collected data from the SMART registry between 1 January 2006 and 1 January 2019. 29 This registry captures all hip and knee arthroplasty procedures from a tertiary institution in Melbourne, Australia. Patient-reported outcome measures (PROMs) were captured using the Veterans-RAND 12 (VR12) and Western Ontario and McMaster Universities Arthritis Index (WOMAC) scores. The Registry has a reported 98% follow-up rate at 1 year post-surgery with a 0.1% patient opt-out rate for participation in PROM data. 29

Eligibility criteria
All patients from the SMART registry who underwent primary TKA for osteoarthritis were considered for model development. 30 Revision TKA and primary unicondylar, patellofemoral and bicompartmental arthroplasties were excluded. Additionally, patients with inflammatory arthritis, rheumatoid arthritis, or fracture as an indication for surgery were excluded.

Outcome measures
The primary predicted outcome was HRQoL parameters at 12 months after TKA. This was measured as a change in utility score calculated from VR12 responses. Standardized Brazier's method was used to convert VR12 responses into a utility score. 31 The utility score measures preference values that patients attach to their overall health status. 32 A value of 0 is a health state equivalent to death, whereas a value of 1 is equal to perfect health. Minimal clinically important difference (MCID) values were calculated from utility score studies based on the United Kingdom population. 33 There were no suitable MCID values for the VR12 utility score calculated using an Australian population. From the United Kingdom population, the MCID for improvement after TKA was determined to be an increase in utility score of 0.09 at 6 months post-surgery. Previous studies have indicated no significant difference in utility score between 6 and 12 months post-TKA. 34 Utility score changes greater than or equal to 0.09 were classified as 'improvement', and those less than 0.09 were classified as 'no improvement' for our models.

Measure of model performancediscrimination
Discrimination is the ability for a predictive model to distinguish between patients with and without the outcome. For classification predictive models, the concordance (c) statistic is the most commonly used measure of performance. 35 For binary outcomes, the c-statistic is equivalent to the area under the receiver operating characteristic (ROC) curve. In some articles, this is also termed more simply the area under the curve (AUC). The ROC plots the true positive rate against 1false positive rate at multiple consecutive cut-offs for the probability of an outcome. 20 Additional metrics included in the analysis were sensitivity, specificity, positive predictive value, and negative predictive value. 36

Measure of model performancecalibration
Calibration is a measure of agreement between observed outcomes and predictions. 37 Three metrics were used to measure calibration from differing perspectives. The calibration intercept measures the average agreement between model predictions and outcome prevalence. 38 In contrast, the calibration slope measures the precision of the model predictions. 20 In other words, the calibration intercept measures over or under the prediction of actual outcomes, and the calibration slope measures the precision of the model's predictions. The Brier score represents an overall measure of model performance and takes into account calibration (and to some degree discrimination). 39 This score was calculated from the mean squared difference between the actual outcomes and the probabilities for predicted outcomes. 40

Predictor variable selection
Patient-specific predictors based on clinical relevance and a literature review were considered for the model. [41][42][43][44][45][46] Potential predictors that required clinician expertise, such as radiographic findings and blood test results, were excluded. Predictors used for the model development included age (continuous), gender (binary), ethnicity (categorical), body mass index (BMI) (continuous), socioeconomic indicator for areas (categorical), remoteness (categorical), indication for TKA (categorical), comorbidities (categorical), Charlson comorbidity index (categorical), VR12 responses (continuous discrete for combined scores and categorical per question), Short-Form 6 Domains (categorical), and Western Ontario and McMasters Universities Osteoarthritis Index (WOMAC) (continuous discrete for combined scores and categorical per question). Final predictors were selected based on optimal discrimination and calibration metrics. To determine the predictors that optimized discrimination, this was defined as the predictor combination with the highest AUC value for each model. Similarly, the optimized calibration predictors were defined as the combination that produced a calibration intercept closest to 0 and a calibration slope closest to 1.

Missing data
Most predictors considered during model development reported less than 1% missing data (Table 1). During the development phase, model iterations were built using different missing data strategies. These included dropping missing observations, k-nearest neighbour imputation, tree bagged imputation and multiple imputation via chained equations. The final model consisted of less than 2% missing data in all observations; therefore, complete case analysis was used rather than imputation. [47][48][49] Model development and statistical analysis The data were randomly split into training (75%) and testing (25%) datasets. The split was stratified by utility score change category (improvement or no improvement). Four predictive models were developed using different techniques: standard logistic regression, 50 standard classification tree, 51 extreme gradient (XG) boosted classification tree, 52 and random forest. 53 All models used 10-fold cross validation with bootstrapping.
The standard logistic regression model was constructed using a backwards stepwise elimination method. 54 This approach begins with a full model, and each step gradually removes predictors from the model to find a reduced model that best explains the data. Classification trees partition the data to visually represent the relationship between potential predictors and the outcome of interest. Information gain, Chi-square, and Gini impurity were used as tree splitting methods. 55 Both XG boosted trees and random forest are ensemble tree-based methods. XG boosted trees fitted a series of classification trees in sequential order, with each subsequent tree altered to improve prediction errors made by the previous tree. The process was repeated 1000 times to optimize the model. In comparison, random forest used bootstrapped samples to fit multiple classification trees (between 500 and 2000 trees). The use of multiple trees reduces the individual errors that can occur from single tree learning methods. 56 We also used tree-based methods to improve the performance of the logistic regression model. Bootstrapped classification trees were created to identify where continuous variables (age and BMI) were commonly split. Based on these split points, age and BMI were converted to categorical variables to capture complex relationships for the logistic regression model.

Model evaluation
The training set was used to develop the optimal model for each modelling technique. This was based on the ability to correctly predict improvement in utility score after TKA. The testing set was used to evaluate the out-of-sample model performance based on the AUC and calibration plot.

Probability decile groups
When evaluating the model using the testing set, the model generated a probability score for improvement (0-1) for each individual, with 1 being equal to a 100% probability for improvement. All individuals in the testing set were then ranked by the probability score from lowest to highest. Decile groups were created by cutting the testing set into 10 equal groups based on the ranked probability score. Predicted and actual outcomes were reported in a confusion matrix within each decile group ( Table 2). The intention is for new patients who use the model to have an individual probability score that fits into one of the 10 decile groups. Reported predictions for these new patients will be based on the actual outcomes that occurred in each decile group.

Results
A total of 3755 patients met the eligibility criteria and were included in the study cohort.  Table 3.

Overall best model performance
Measures of discrimination and calibration are outlined in Table 4. The model that performed the best with discrimination was the standard logistic regression with an AUC of 0.712. From a calibration perspective, all models were moderately well calibrated except for the standard classification tree, which performed the poorest (intercept 0.290; slope 0.552; Brier score 0.227). Furthermore, the standard classification tree model demonstrated a significant  decrease in the calibration slope for predictions with a higher probability for improvement (Fig. 1b). Calibration plots for all models are presented in Figure 1.

Best model performance by decile
Based on the discrimination and calibration performance, the standard logistic regression model was selected as the optimal predictive model ( Table 5). The default threshold for the model to predict improvement or no improvement was a probability score of 0.5. However, this threshold was altered to optimize the discrimination performance of the model. Using AUC as the output metric, the optimal probability score threshold was found to be 0.602. In the highest decile, the true probability for improvement was 85%. In the lowest decile, the true probability for no improvement was 68%. Full outcomes based on decile groups are presented in Table 2.

Discussion
When considering discrimination and calibration metrics, the best performing model was standard logistic regression. This finding suggests that when using patient-focused predictors, more complex machine learning algorithms may not be superior to standard logistic regression at predicting utility score improvement. Furthermore, all models performed better with respect to calibration than discrimination. The best calibrated model in this study was also the standard logistic regression model. There are several features unique to the SMART Choice tool. First, the development of a predictive tool specifically for patient use is uncommon in the context of arthroplasty. The need for patients to access a patient-focused tool is essential towards achieving the goal of individualized care. Although there are performance sacrifices to consider with our patient-focused tool, the overall benefit to the patient comes from the ability to make more informed decisions about their health. Patient-decision support tools have been developed in recent years from the United Kingdom (UK) National Joint Registry. 57 However, limited validation has occurred in non-UK populations. The SMART Choice tool will be the first to be developed from an Australian population.
Second, the understanding that patient satisfaction is multifactorial underpins our use of HRQoL as a predictive outcome. The HRQoL utility score captures a broad range of health domains, including both physical and mental health. In comparison, traditional patient outcomes focus on symptoms restricted to the affected  anatomy or pathology. 58 The disadvantage with the traditional approach is the neglect to consider wider implications of health that knee arthritis can cause. Furthermore, in the context of health economics, the use of HRQoL utility scores can aid in calculations when performing cost-utility and cost-benefit analyses. 59 Third, the use of probability deciles is unique to the SMART choice tool. To our knowledge, probability deciles have not been used in previously published predictive models for arthroplasty. The exception to this observation is that the concept of deciles has been used successfully in previous economic evaluations of arthroplasty. 46 Without stratification of patients into deciles, the overall model showed a significant density overlap for actual outcomes when the probability scores were between 0.6 and 0.7 (Fig. 2). The negative implication of this would be a reduction in discrimination for patients who are at the upper or lower ends of the probability distribution. Based on the stratified model, patients in decile 10 who were predicted to improve had an 85% chance of this prediction being true. When compared with the non-stratified model, patients who were predicted to improve had only a 70% chance of this prediction being true ( Table 6). As a result, the use of deciles allowed for a more individualized prediction of patient outcomes. Furthermore, the high calibration of the logistic regression model supported the use of deciles as a predictive outcome. 20 In comparison to most predictive tools that are designed for clinician use, SMART Choice reported slightly lower AUC values. 1 However, this should be interpreted in the context that the intention of this study was to develop a predictive model for patients to use early in the TKA journey. As a result, potential predictors for TKA outcome, including surgeon, implant, technical, and biological factors, were excluded. If a wider range of predictors could be included in this model, then we may expect to see an improvement in AUC performance. 60 Nonetheless, the calibration performance of the standard logistic regression model was still useful in a clinical context. Probability scores generated by this model could convey Fig. 2. Probability distribution for the standard logistic regression model. The red area represents the probability distribution for patients whose actual outcome was no improvement after surgery. The blue area represents the probability distribution for patients whose actual outcome was improvement after surgery.
the likelihood of success after surgery without needing to explicitly predict whether a patient will improve or not improve. 61 The advantage of this approach is that patients can decide at what point the probability score is too risky for surgery. This allows the patient's risk appetite and healthcare values to be incorporated into the prediction model output when making decisions about surgery. 62 For example, a patient who may perceive themselves to have severe symptoms might be willing to undergo surgery with any probability of success greater than 60%. However, another patient who is suffering from milder symptoms and is more risk averse may only be willing to undergo surgery if the probability of success is greater than 80%. Rather than using our tool to pre-set patients as 'high risk' or 'low risk' for surgery, the presentation of probability scores with SMART Choice defers this interpretation of risk to the patients.
Despite machine learning methods performing poorer than standard logistic regression in this study, machine learning algorithms still added analytical value to the model development process. In this study, classification trees were used to visually explore the relationship between predictors and outcomes. The classification tree model demonstrated that a VR12 pain interference score of 4 or higher was strongly correlated with outcome after surgery (Fig. 3). In addition, the VR12 responses for general health and pain interference were consistently used proximally in the classification tree to split predicted outcomes. These relationships are reflected in the variable importance calculations of the final logistic regression model (Fig. 4). Of interest, BMI and sex were less correlated with actual outcomes. However, the interpretability of predictor relationships in classification trees may only partially be translated to logistic regression models. 63 When considering the best setting to implement the SMART Choice tool, there are several options to consider. Ideally, this tool would be embedded in public health education websites about osteoarthritis. [64][65][66] This could then capture patients who are considering surgery but have not yet seen a clinician. Alternatively, the tool could be used by patients already on the journey towards TKA but have not yet seen a surgeon. This may include asking patients to use this tool before seeing a General Practitioner or before referral to an Orthopaedic Surgeon. We are conducting qualitative research to evaluate the optimal setting and timepoint for tool implementation.
There are several limitations to consider with our study. First, the model cohorts use prospectively collected institutional registry data. Patients who did not reach the 12-month follow-up due to death or other causes were excluded from the analysis and therefore may introduce selection bias to the study. In addition, due to the institutional nature of our registry, we are unable to extrapolate the findings of our study to a wider population. Second, some lifestyle data, such as alcohol consumption and smoking status, were selfreported. Previous studies have suggested that comorbidity and lifestyle data may be correlated with arthroplasty outcomes. 46,67 Therefore, any bias in the capture of lifestyle factors may influence the outcomes of predictive models that use these variables. Third, we note that the logistic regression model performance could have been improved if the data were subset. For example, a model limited to patients older than 65 years would have improved the AUC    to 0.722. However, this would have resulted in 27% of the cohort being excluded from the model due to agea realistic representation of the proportion of TKA performed in patients under 65 years. 68 To maximize uptake of predictive model use, we decided to include a wider age range to allow maximum utility among a greater proportion of TKA patients. Fourth, despite our reasonable sample size of 3755 patients, this cohort is likely still too small to generate optimal predictive models, especially with machine learning methods. 69 Complex relationships are more likely to be found in larger datasets using machine learning methods. 70 In addition, at the time of developing this model, MCID values for utility score in Australian populations were not available. Therefore, we used MCID values derived from a United Kingdom population instead. This may add additional uncertainty to our model. Finally, to validate and evaluate this predictive model in a clinical setting, a pragmatic clinical trial should be conducted. 27 Without data from clinical trials, it would be difficult to attribute the statistical findings in this study to a real-world setting.
In conclusion, the standard logistic regression model performed the best with respect to discrimination and calibration. The standard logistic regression model was also well calibrated enough to generate accurate probability scores for improvement after surgery based on deciles. ML algorithms did not perform better than regression models in this study. Further evaluation is required using clinical trials to understand the effect of the SMART Choice tool on patient care and outcomes.