Developing Machine Learning Algorithms to Support Patient-centered, Value-based Carpal Tunnel Decompression Surgery

Background: Carpal tunnel syndrome (CTS) is extremely common and typically treated with carpal tunnel decompression (CTD). Although generally an effective treatment, up to 25% of patients do not experience meaningful benefit. Given the prevalence, this amounts to considerable morbidity and cost without return. Being able to reliably predict which patients would benefit from CTD preoperatively would support more patient-centered and value-based care. Methods: We used registry data from 1916 consecutive patients undergoing CTD for CTS at a regional hand center between 2010 and 2019. Improvement was defined as change exceeding the respective QuickDASH subscale’s minimal important change estimate. Predictors included a range of clinical, demographic and patient-reported variables. Data were split into training (75%) and test (25%) sets. A range of machine learning algorithms was developed using the training data and evaluated with the test data. We also used a machine learning technique called chi-squared automatic interaction detection to develop flowcharts that could help clinicians and patients to understand the chances of a patient improving with surgery. Results: The top performing models predicted functional and symptomatic improvement with accuracies of 0.718 (95% confidence interval 0.660, 0.771) and 0.759 (95% confidence interval 0.708, 0.810), respectively. The chi-squared automatic interaction detection flowcharts could provide valuable clinical insights from as little as two preoperative questions. Conclusions: Patient-reported outcome measures and machine learning can support patient-centered and value-based healthcare. Our algorithms can be used for expectation management and to rationalize treatment risks and costs associated with CTD.


Missing Data
To develop machine learning algorithms that predicted symptomatic improvement following CTD, we included 1093/1916 patients who had complete response sets to QuickDASH items 9-11 preoperatively and postoperatively. Of the 823 patients with incomplete response sets, 792 were missing postoperative item responses.
To develop machine learning algorithms that predicted functional improvement following CTD, we included 1045/1916 patients who had complete response sets to QuickDASH items 1-6 preoperatively and postoperatively. Of the 871 patients with incomplete response sets, 839 were missing postoperative item responses.
Predictors of missing postoperative responses were identified through the finalfit package (see our R script entitled "02 Missing data analysis.R")

Features
Supplementary The number of cigarettes the patient smokes a day (continuous integer) CigaretteYears The number of years for which the patient has smoked (continuous integer) heart Self-reported current heart disease, treated or untreated (categorical: no or yes) hbp Self-reported current high blood pressure, treated or untreated (categorical: no or yes) lung Self-reported current lung disease, treated or untreated (categorical: no or yes) diabetes Self-reported current diabetes, treated or untreated (categorical: no or yes) ulcer Self-reported current stomach ulcers, treated or untreated (categorical: no or yes) kidney Self-reported current kidney disease, treated or untreated (categorical: no or yes) liver Self-reported current liver disease, treated or untreated (categorical: no or yes) anaemia Self-reported current anaemia, treated or untreated (categorical: no or yes) cancer Self-reported current cancer, treated or untreated (categorical: no or yes) depression Self-reported current depression, treated or untreated (categorical: no or yes) osteoarthritis Self-reported current osteoarthritis, treated or untreated (categorical: no or yes) backpain Self-reported current back pain, treated or untreated (categorical: no or yes) rheumatoid Self-reported current rheumatoid arthritis, treated or untreated (categorical: no or yes) Thyroid Self-reported current thyroid disease, treated or untreated (categorical: no or yes) EQ5D_Mobility Response to the EQ-5D-5L Mobility Dimension (continuous integer: 1-5) EQ5D_Selfcare Response to the EQ-5D-5L Self Care Dimension (continuous integer: 1-5) EQ5D_Usual_Activities Response to the EQ-5D-5L Usual Activities Dimension (continuous integer: Response to the EQ-5D-5L Pain/Discomfort Dimension (continuous integer: 1-5) EQ5D_Anxiety Response to the EQ-5D-5L Anxiety/Depression Dimension (continuous integer: 1-5) EQ5D_VAS Response to the EQ-5D-5L visual analogue scale (continuous integer: 1-100) DASH1 Response to QuickDASH item 1, relating to difficulty in opening a tight jar (continuous integer: 1-5) DASH2 Response to QuickDASH item 2, relating to difficulty doing household tasks (continuous integer: 1-5) DASH3 Response to QuickDASH item 3, relating to difficulty carrying shopping (continuous integer: 1-5) DASH4 Response to QuickDASH item 4, relating to difficulty washing one's back (continuous integer: 1-5) DASH5 Response to QuickDASH item 5, relating to difficulty in using a knife (continuous integer: 1-5) DASH6 Response to QuickDASH item 6, relating to difficulty in forceful recreational activities (continuous integer: 1-5) DASH7 Response to QuickDASH item 7, relating to interference with social activities (continuous integer: 1-5) DASH8 Response to QuickDASH item 8, relating to interference with work activities (continuous integer: 1-5) DASH9 Response to QuickDASH item 9, relating to pain severity (continuous integer: 1-5) DASH10 Response to QuickDASH item 10, relating to paresthesia severity (continuous integer: 1-5) DASH11 Response to QuickDASH item 10, relating to night pain (continuous integer:

Preprocessing
For both the symptomatic improvement and functional improvement classifiers, training dataset class balance (improved vs not improved) was checked prior to bootstrapping and hyperparameter tuning. Class imbalance in the symptomatic improvement classifier training data was addressed with a synthetic minority oversampling technique (SMOTE) algorithm. 1 Dummy variables were created for categorical predictors via level encoding. Continuous predictors were scaled by 2 standard deviations. Missing data were imputed via mode imputation for categorical predictors and mean imputation for continuous predictors. Predictors with a near zero variance were removed (see scripts "04 Symptom classification.R" and "05 Function classification.R")

Hyperparameter Tuning
Model hyperparameters were tuned using grid search implemented through the tidymodels framework. 2 Regular grids with 10 levels per tuned hyperparameter were used for this process, except for the extreme gradient boosted decision tree ensemble (XGB), which has a large parameter space. In this case, Latin hypercube sampling was used to generate the hyperparameter grid. 3 For the logistic regression with elastic net regularization, penalty and mixture hyperparameters were tuned, relating to the shrinkage operator and the proportion of L1:L2 regularization respectively.
For the XGB, the number of trees was set to 1000, while we tuned: tree depth, the minimum number of observations per node required to undertake a split, the loss reduction required to undertake a split, the sample size, the number of predictors sampled per split and the learn rate.
For the support vector machine (SVM) the cost penalty and polynomial degree were tuned. For the ANN, the number of hidden layers and dropout rate were tuned over 10 epochs. For the K nearest neighbors (KNN) algorithm, the number of neighbors and choice of kernel function were tuned.
Optimal model hyperparameters were selected from the bootstrapped samples based on mean classification accuracy. Final model parameters were then obtained by fitting the models to the whole training dataset.

Heuristic Models
We chose QuickDASH item 9, QuickDASH item 11, and the EQ-5D-5L VAS as variables for the symptomatic improvement CHAID classifier. We chose QuickDASH item 1 and the EQ-5D-5L Mobility domain as variables for the functional improvement CHAID classifier. These variables were chosen based on Shapley values from the respective XGB algorithms, perceived ease of implementation, and clinical plausibility.
For each tree, the level of significance for splitting nodes and for merging predictor categories was set to p < 0.05. A minimum of 20 observations per node were required to undertake a split.

Missing Data
A missing data analysis (Supplementary Table 2 and Supplementary Table 3) suggested missing data were largely missing at random. 4 Compared to those with complete postoperative response sets, patients with missing follow-up responses were younger, had experienced symptoms for a longer time and had a higher BMI. Nonresponders were more likely to be smokers and report depression. Responders were more likely to be currently employed, report high blood pressure and report osteoarthritis.

Class Balance
Prior to using the SMOTE, the symptom classifier training dataset contained 208 patients that did not improve and 611 that did improve. Following the SMOTE, the symptom classifier training dataset contained 624 patients in each class. In the function classifier training dataset, there were 398 patients that did not improve and 385 that did improve. We did not conduct the SMOTE in the function classifier training dataset.

Model Performance
Confusion matrices for each model are presented in Supplementary