Assessing the Value of Imaging Data in Machine Learning Models to Predict Patient-Reported Outcome Measures in Knee Osteoarthritis Patients

Knee osteoarthritis (OA) affects over 650 million patients worldwide. Total knee replacement is aimed at end-stage OA to relieve symptoms of pain, stiffness and reduced mobility. However, the role of imaging modalities in monitoring symptomatic disease progression remains unclear. This study aimed to compare machine learning (ML) models, with and without imaging features, in predicting the two-year Western Ontario and McMaster Universities Arthritis Index (WOMAC) score for knee OA patients. We included 2408 patients from the Osteoarthritis Initiative (OAI) database, with 629 patients from the Multicenter Osteoarthritis Study (MOST) database. The clinical dataset included 18 clinical features, while the imaging dataset contained an additional 10 imaging features. Minimal Clinically Important Difference (MCID) was set to 24, reflecting meaningful physical impairment. Clinical and imaging dataset models produced similar area under curve (AUC) scores, highlighting low differences in performance AUC < 0.025). For both clinical and imaging datasets, Gradient Boosting Machine (GBM) models performed the best in the external validation, with a clinically acceptable AUC of 0.734 (95% CI 0.687–0.781) and 0.747 (95% CI 0.701–0.792), respectively. The five features identified included educational background, family history of osteoarthritis, co-morbidities, use of osteoporosis medications and previous knee procedures. This is the first study to demonstrate that ML models achieve comparable performance with and without imaging features.


Introduction
Osteoarthritis (OA) is the primary contributor to disability and chronic pain among patients over 60 years of age [1].Knee OA is estimated to affect 654 million people worldwide, with pain and joint stiffness significantly impacting daily activities, quality of life and emotional well-being, particularly among the ageing population [2,3].Patient-reported outcome measures (PROMs), such as the Oxford Knee Score (OKS) and the Western Ontario and Mc-Master Universities Arthritis Index (WOMAC), are widely utilised to evaluate symptoms in knee osteoarthritis [4].WOMAC is a reliable multi-dimensional health status assessment tool that is considered the gold standard in evaluating OA severity and monitoring disease progression [5][6][7].The added advantage of WOMAC over other PROMs is that it gives a more comprehensive evaluation of pain, stiffness and physical function [5,6].It may also aid in decision-making around the timing of total knee arthroplasty (TKA) [8,9].Recent studies have highlighted the need for reliable prediction of future PROMs to enhance the shared decision-making process with patients regarding their long-term treatment options [10].
Imaging modalities demonstrated increased diagnostic sensitivity for symptomatic individuals with knee OA; nevertheless, their specific advantage in guiding clinical decisionmaking regarding management options remains uncertain [11,12].Erlangga et al.'s systematic review found limited or inconclusive associations between magnetic resonance Bioengineering 2024, 11, 824 2 of 17 imaging (MRI) findings and OA-related knee pain across 22 studies [13].However, moderate levels of evidence were found for imaging features such as bone marrow lesions and synovitis being an indication of knee pain in such patients [13].Moreover, other studies reported certain MRI features identified in symptomatic patients that were also evident in asymptomatic individuals [11,14].
The increased use of imaging has led to a significant rise in healthcare expenditure around the world.Recent figures show the costs of unnecessary knee MRIs by a single consultant in the United Kingdom to be over GBP 13,000 per year [15].Nationally, a study from Norway estimated that unnecessary knee MRIs cost around EUR 6.7-EUR 9.8 million every year [16].Similarly, for plain radiographs, a study by Ashikyan et al. analysing the medical records of 500 knee OA patients in the United States found that over 15% of the six-month follow-up radiographs performed were deemed non-essential to patient care, amounting to USD 10,800 per year [17].The financial impact and radiation risks add to the need for further research on the role of imaging in forecasting meaningful clinical changes in knee OA patients [18,19].
Whilst traditional statistical models are inherently constrained by their linearity assumptions and limited capacity to handle complex datasets, machine learning (ML) algorithms exhibit superior predictive power and feature engineering advantages [20][21][22].In knee OA, previous ML studies have predominately focused on predicting objective outcomes such as joint space narrowing and the need for TKR, overlooking other vital domains for these patients such as stiffness and functional ability [23,24].Two studies have compared models based on clinical data alone, with those based on MRI data, in predicting the necessity for a TKA, concluding that incorporating MRI scans was of no additional value [25][26][27].However, the role of imaging in predicting symptoms, which are the main drivers for patients' treatment choices, remains unexplored, as does the need for robust external validation of such models [28,29].
This study aimed to develop and externally validate ML models, and comparatively evaluate their performance with and without imaging features, to forecast the 2-year WOMAC score of patients with knee OA.Secondary objectives included identifying the most influential features that contribute to the predictive ability of the top-performing model.We hypothesised that machine learning models lacking imaging features would demonstrate comparable performance, as evaluated by the Area Under Curve metric, to models incorporating imaging features in predicting the 2-year WOMAC scores.

Ethics Considerations
No ethical approval was required for this study owing to the open access nature of the OAI and MOST databases.Ethical approval and informed consent for collecting data about participants were obtained by the OAI and MOST datasets.The OAI dataset is hosted by the Osteoarthritis Initiative Data Coordinating Center (OAI DCC) at the University of California, San Francisco (UCSF), and is available through the National Institute of Health (NIH) NIAMS repository: https://nda.nih.gov/oai(accessed on 5 March 2021).The MOST data are accessible through the MOST Online Data Repository and supported by the NIH NIAMS: https://most.ucsf.edu/(accessed on 5 March 2021).

Data Source
This study used the Osteoarthritis Initiative (OAI) database (https://nda.nih.gov/oai/, accessed on 5 March 2021) to train and internally validate the ML models, and the Multicenter Osteoarthritis Study (MOST) database (https://most.ucsf.edu,accessed on 5 March 2021) to externally validate its performance.Both OAI and MOST databases are multi-centre longitudinal prospective studies assessing men and women in the United States with, or at high risk of, symptomatic knee OA.OAI enrolled 4796 subjects aged between 45 and 79 years from February 2004 to May 2006, while MOST enrolled 3026 subjects aged between 50 and 79 years from April 2003 to April 2005.Both databases are publicly available, with the study design and protocol approved by local institutional boards of participating centres and informed consent obtained from all participants [30,31].Imaging features used from OAI and MOST databases were read by trained assessors, blinded to patient details and clinical status.MRI scans were read using the Whole-Organ Magnetic Resonance Imaging Score (WORMS) scoring system.

Outcome Measure
Our primary outcome was the binary 2-year WOMAC change (improvement/no improvement), which comprises three domains including knee pain, stiffness and functional limitations, as reported by the participants.The WOMAC questionnaire consisted of a total of 24 items, with each question being scored between 0 (None) and 1-4 (Mild-Extreme) points.A threshold score of 24 was selected based on the minimal clinically important difference (MCID) from previous studies that suggested meaningful symptomatic physical impairment in patients [32][33][34].In other words, a total WOMAC score of 24 and above was categorised as clinically symptomatic (positive class), while a score below 24 (negative class) was considered less significant.

Feature Selection and Data Pre-Processing
In total, 1187 and 553 features were analysed using the OAI and MOST databases, respectively.To enable external validation, only variables present in both OAI and MOST databases with ≥60% completeness were included in this study.
In an attempt to generate explainable ML models, the features were systematically transformed into meaningful categorical comparisons in both databases.This process involved converting continuous variables, such as blood pressure, into discrete stages of hypertension, as per clinical guidelines [35].Inter-related features such as pain medications were combined into a single variable based on the WHO analgesic ladder (no analgesia, non-steroidal anti-inflammatory drugs, narcotics) to prevent the dilution of features and improve model performance [36,37].Features exhibiting high collinearity, such as the type of surgery performed, were excluded to reduce to mitigate redundancy.Table 1 shows the final set of features selected for model training, with additional information on the classification levels for each feature presented in Appendix A. Two separate datasets were created to evaluate the role of imaging features in predicting meaningful clinical changes in 2-year WOMAC scores.As shown in Table 1, the clinical dataset contained the selected clinical features but not the radiographic and MRI variables.The imaging dataset contained the features in the previous dataset in addition to the radiographic and MRI variables.Subjects (n = 608) that had an MRI performed in both legs or more than once at baseline were recorded and analysed as separate observations.
Patients with missing data in any of the final features were removed from this study.In the OAI database, both clinical and imaging datasets had their observations split into 80% training and 20% internal validation sets.The MOST database was later used to allow for an unbiased external validation.Figure 1 shows a summary of the pre-processing, training and testing stages.

Model Development, Training and Validation
Five linear and tree-based classification machine learning algorithms were developed, namely, Least Absolute Shrinkage (LASSO) Regression, Ridge Regression, Decision Tree (DT), Random Forest (RF) and Gradient Boosting Machine (GBM), to minimise variance and bias, and then compared to traditional multivariate Logistic Regression (LR), to predict the binary 2-year WOMAC change (improvement/no improvement) [38][39][40][41][42].This was followed by hyperparameter tuning via 10-fold cross-validation in the LASSO, Ridge, RF and GBM models to reduce over-fitting in the training set [43].No tuning was required for the base models using linear (LR) and tree-based (DT) algorithms.The models were then tested on the previously unused MOST database for external validation.Model performance was assessed in terms of the gold-standard area under the receiver operating characteristic curve (AUC) [44].To account for the class imbalance between positive and negative cases in the outcome measure, the F1 score was computed [45].Given that the F1 score is dependent on the decision threshold to convert outcome probabilities into discrete classes, its decision thresholds were optimised to maximise the score and achieve the highest positive predictive ability.

Statistical Analysis and Feature Importance
Descriptive statistics, including mean and percentage, were used to describe the rates of changes in the 2-year WOMAC score across the OAI and MOST databases.All mathematical modelling was carried out using R statistical computing environment version 4.3.0(R: A language and environment for statistical computing).R packages 'survival' (version 3.6-4), 'gbm' (version 2.1.8),'glmnet' (version 4.1-4), 'tree' (version 1.0-43), 'rpart' (version 4.1.16),and 'randomForest' (version 4.7-1.1)were used for survival analysis.Further details of software packages are provided in Appendix A. Two-tailed Wilcoxon Signed-Rank test was used to assess the difference in non-parametric AUC scores of imaging and clinical dataset models.However, caution is advised when interpreting the results since our findings are not subject to inferential testing and do not establish statistical significance.

Data Distribution
The final dataset included 2408 and 629 observations from the OAI and MOST databases, respectively.Descriptive statistics related to patient demographics, clinical data and imaging features for OAI and MOST databases are presented in Table 2.As highlighted in Table 2, the majority of patients were from White or Caucasian ethnic backgrounds, amounting to approximately 84.3-89.5% of observations.Variances between the OAI and MOST databases were notable in the analgesic medication, Kellgren-Lawrence (KL) grade and patellofemoral joint cartilage morphology.Nevertheless, a comparable proportion of patients exhibited normal WOMAC baseline scores in both databases (OAI: 73.7%, MOST: 73.1%).

Model Performance
The OAI database was split into 80% (n = 1926) training and 20% (n = 482) internal validation cohorts, whilst 100% of the MOST database (n = 629) was used for external validation.Table 3 shows the AUC values for the six models in the training and internal validation sets.ML models in the datasets from OAI had AUC score ranges of 0.628-0.820 in the training and 0.630-0.786 in the internal validation sets.These scores consistently surpassed the clinically acceptable threshold of AUC > 0.70, with the exception of the Decision Tree (DT) algorithm (AUC = 0.62-0.66).Across both clinical and imaging datasets, the Random Forest (RF) algorithm was the best-performing model in both training and internal test sets (AUC = 0.77, AUC = 0.78, respectively).There was comparable model performance, in terms of AUC values, between the clinical and imaging datasets, with a marginal difference of 0.025.
Figure 2 highlights the AUC receiver operating characteristic curves (ROC) for the six models in the external validation MOST cohort.GBM models performed with the highest AUC in the external test for clinical (AUC = 0.734) and imaging datasets (AUC = 0.747).
Due to the low performances (AUC < 0.7) in both internal and external tests, DT was excluded from further analysis.Besides the DT algorithm, which had the lowest predictive ability (AUC < 0.70), variations in AUC scores between clinical and imaging datasets were minimal (<0.02) across all other models, with no statistical differences (p > 0.05).The ROC curves for training and internal validation sets are provided in Appendix B.  Excluding the DT model, the F1 score, a relative measure of a model's ability to identify true positive classes, exceeded 0.5 for all models in both clinical and imaging datasets (Table 4).Random Forest (RF) and Gradient Boosting Machine (GBM) achieved the highest F1 scores in internal (F1 = 0.617) and external (F1 = 0.548) tests, respectively.Similar to AUC values, there was no significant variation in F1 scores between clinical and imaging datasets (<0.03).Further information on the change in AUC and F1 scores between internal and external tests is provided in Appendix B.  Excluding the DT model, the F1 score, a relative measure of a model's ability to identify true positive classes, exceeded 0.5 for all models in both clinical and imaging datasets (Table 4).Random Forest (RF) and Gradient Boosting Machine (GBM) achieved the highest F1 scores in internal (F1 = 0.617) and external (F1 = 0.548) tests, respectively.Similar to AUC values, there was no significant variation in F1 scores between clinical and imaging datasets (<0.03).Further information on the change in AUC and F1 scores between internal and external tests is provided in Appendix B.

Feature Importance
The GBM model was the best-performing model across both clinical and imaging datasets, as evaluated by the AUC score.The top five most influential factors that affected the predictive ability of GBM models are given in Table 5.The patients' educational background exerted the most substantial influence on the predictive ability of the GBM model in the clinical dataset, accounting for 22% of feature importance, and ranked second with approximately 8% in the imaging dataset (Table 5).In the clinical dataset, this was followed by factors such as osteoarthritis history, comorbidities, medicated osteoporosis and a previous knee operation.Conversely, in the imaging dataset, the most influential feature was KL grade (~10%).Other contributing features in the imaging model included the 20 m walk test, joint space narrowing (JSN) in the medial compartment, and the use of analgesics.

Discussion
This study aimed to evaluate whether machine learning models can attain comparable performance in predicting a binary outcome in the 2-year WOMAC score in patients with knee OA, and compare that to traditional logistic regression, irrespective of the inclusion of MRI and radiographic features, using the OAI and MOST databases.Our findings highlighted comparable predictive capabilities, with minimal differences, of less than 0.025 in the area under the curve (AUC) values, whether the models incorporated imaging features or not.The GBM algorithm demonstrated the highest AUC and F1 scores at external validation, achieving similarly acceptable scores for imaging dataset models, and outperforming the logistic regression model.
This study adopted an ML approach and identified that the best-performing models were tree-based algorithms (RF, GBM).Previously, Bastick et al. utilised LR algorithms to detect pain trajectories and predict patient symptoms from their clinical data [46].However, our findings align with previous studies underscoring the superior predictive performance of tree-based algorithms in capturing non-linear relationships when compared to traditional statistical methods [47,48].Importantly, the GBM and the RF models consistently achieved the highest AUC for both clinical and imaging datasets, demonstrating clinically acceptable scores upon external validation, a facet not extensively addressed in prior research [49].
In terms of feature importance, the GBM algorithm in our study alluded to educational background as the most important predictive driver in both clinical and imaging datasets.A previous cross-sectional study analysed the relationship between clinical features and knee OA and reported educational background to have the highest significant negative correlation with a patient's current WOMAC score [50].Their findings highlighted that the education status of patients with knee OA was likely to impact their future functional and QoL outcomes [50].This may be explained by the hidden confounding factors within educational background such as income and type of employment that affect knee OA progression [51].Additionally, low education levels may contribute to a lack of knowledge and awareness regarding lifestyle modifications for managing OA [52].However, in our study, it is important to consider this finding with caution due to the lack of inference testing, inherent limitations of the OAI and MOST databases, and ML algorithmic bias [53][54][55].Future prospective studies are needed to evaluate this causal inference.
In the imaging dataset, the Kellgren-Lawrence (KL) grade from radiographs exerted the most substantial overall influence on the imaging dataset's GBM model, while no MRI features were identified as highly influential in enhancing the predictive ability of the machine learning model.A previous study that used the MOST database included 696 observations and showed a higher occurrence of symptomatic knee pain in KL grades 1-4 (Odds Ratio: 1.5, 3.9, 9.0, 151; respectively) as compared to KL grade 0 [56].Whilst our study favoured explainable machine learning models over the black box approach posed by DL, another study using 9348 observations from the OAI database showed that radiographs analysed through deep learning (DL) alone (AUC = 0.770) were able to predict the symptomatic progression of knee OA better (p < 0.001) than the clinical features (AUC = 0.692) [57].While it has been shown that MRI features are associated with patients experiencing knee pain, individual features were ineffective in discriminating between painful and non-painful knees [58].This suggests that more studies are required to evaluate DL approaches in analysing MRI features for symptom prediction.
Interestingly, Ashinsky et al. used only MRI features from the OAI database to predict a 3-year change in WOMAC score, achieving a 75% classification accuracy [59].Their findings reported cartilage thickness at the central portion of the medial femoral condyle to be the feature that most affected symptomatic OA progression [59].Whilst cartilage thickness was not shown to be a key feature in our study, this is likely due to the greater influence of clinical factors in predicting 2-year WOMAC.Schiratti et al., who combined MRI data with clinical features from the OAI database to predict patients' 1-year WOMAC pain scores, achieved a lower AUC score of 0.724 than the GBM in our study (AUC = 0.747) [60].This may be because they utilised raw MRI images, which can add unnecessary noise to the dataset, which is counterproductive in improving model performance [61].Their study also suggested that intra-articular space has the highest contribution in predicting the patient's pain, which was not recorded in the datasets of our study [60,62].Therefore, this may be a significant feature to obtain in future studies to boost the model's predictive ability.
This study employed two of the largest available osteoarthritis databases (OAI and MOST) to test and externally validate our ML models.Moreover, the input features curated were commonly assessed and recorded for all OA patients, increasing the explainability and application of our models in the real world.However, there are limitations to report.
Firstly, the databases drew subjects exclusively from the United States, resulting in limited diversity in ethnic background, educational status, and socioeconomic factors.Consequently, the generalisability of our models to international contexts with diverse backgrounds might be constrained.Additionally, while the images underwent evaluation by individuals following a standardised scoring system, the inherent subjectivity of this process, as documented in previous research, is a potential limitation [63].A comparative analysis with DL models utilising raw images and continuous data could offer an alternative perspective on their role in predicting outcome scores.Lastly, the models were developed using an imbalanced class dataset, with a minority (26.5%) having a high WOMAC score.This presents a challenge for machine learning algorithms and potentially diminishes their performance.Future research could address this issue through under-sampling methods, such as the K-nearest neighbour algorithm, to rebalance the data [64].
This study underscores the effectiveness of machine learning (ML) models in predicting knee osteoarthritis (OA) severity, as measured by changes in the 2-year WOMAC score, using routinely recorded clinical data without the need for additional imaging.The practical application of predicting WOMAC scores holds promise for clinicians to evaluate the progression of health-related quality of life in knee OA from an early stage, enhancing the shared decision-making process and tailoring patient management strategies.Early, smaller interventions, such as focal resurfacing or unicompartmental knee arthroplasty (UKA), for patients at high risk of developing severe symptoms could potentially enhance their long-term QoL [9].
Future studies are needed to evaluate ML's ability to forecast even longer-term WOMAC scores.While our study focused on symptomatic patients with existing knee OA, the evaluation of WOMAC scores in asymptomatic patients might be helpful in aiding the early decision-making process.Finally, the use of ML in predicting other PROMs that assess QoL, such as SF-36, would enable clinicians to adopt a more holistic approach to patient care in the future.

Conclusions
This study demonstrated that machine learning (ML) models leveraging only clinical data are comparably effective to models incorporating additional imaging features in predicting the 2-year WOMAC score of knee osteoarthritis patients.Gradient boosting machine algorithms emerged as the top-performing ML models for this outcome during external validation, achieving clinically acceptable predictive AUC scores.In the clinical context, this suggests that patient prognosis can be successfully estimated using routinely collected patient data only, providing an opportunity to enhance patient assessment, facilitate timely interventions and avoid unnecessary imaging costs.Table A3.Change in Area Under Curve (AUC) and F1 scores of Machine Learning models at internal and external validation, when adding imaging features to the dataset.

ML Algorithm
Internal  Table A3.Change in Area Under Curve (AUC) and F1 scores of Machine Learning models at internal and external validation, when adding imaging features to the dataset.

Figure 1 .
Figure 1.Flowchart summarising the methodology from data extraction to model training and te ing for Osteoarthritis Initiative (OAI) and Multicenter Osteoarthritis Study (MOST) databases.

Figure 1 .
Figure 1.Flowchart summarising the methodology from data extraction to model training and testing for Osteoarthritis Initiative (OAI) and Multicenter Osteoarthritis Study (MOST) databases.

Figure 2
Figure2highlights the AUC receiver operating characteristic curves (ROC) for the six models in the external validation MOST cohort.GBM models performed with the highest AUC in the external test for clinical (AUC = 0.734) and imaging datasets (AUC = 0.747).Due to the low performances (AUC < 0.7) in both internal and external tests, DT was excluded from further analysis.Besides the DT algorithm, which had the lowest predictive ability (AUC < 0.70), variations in AUC scores between clinical and imaging datasets were minimal (<0.02) across all other models, with no statistical differences (p > 0.05).The ROC curves for training and internal validation sets are provided in Appendix B.

Figure 2 .
Figure 2. Receiver operating characteristic (ROC) curves showing Area Under Curve (AUC) scores (with 95% confidence intervals) of all six (a) clinical and (b) imaging machine learning algorithms at external validation from Multicenter Osteoarthritis Study (MOST).Thin black line represents performance of a random classifier (AUC = 0.500).All values shown to three significant figures.LR, Logistic Regression; DT, Decision Tree; RF, Random Forest; GBM, Gradient Boosting Machine.

Figure 2 .
Figure 2. Receiver operating characteristic (ROC) curves showing Area Under Curve (AUC) scores (with 95% confidence intervals) of all six (a) clinical and (b) imaging machine learning algorithms at external validation from Multicenter Osteoarthritis Study (MOST).Thin black line represents performance of a random classifier (AUC = 0.500).All values shown to three significant figures.LR, Logistic Regression; DT, Decision Tree; RF, Random Forest; GBM, Gradient Boosting Machine.

Figure A2 .
Figure A2.Receiver Operating Characteristic (ROC) curves showing Area Under Curve (AUC) scores (with 95% Confidence Intervals) of all six (a) Clinical and (b) Imaging Machine Learning algorithms in the Internal Test Set from Osteoarthritis Initiative.Thin black line represents performance of a random classifier (AUC = 0.500).All values shown to 3 significant figures.LR, Logistic Regression; DT, Decision Tree; RF, Random Forest; GBM, Gradient Boosting Machine.

Figure A2 .
Figure A2.Receiver Operating Characteristic (ROC) curves showing Area Under Curve (AUC) scores (with 95% Confidence Intervals) of all six (a) Clinical and (b) Imaging Machine Learning algorithms in the Internal Test Set from Osteoarthritis Initiative.Thin black line represents performance of a random classifier (AUC = 0.500).All values shown to 3 significant figures.LR, Logistic Regression; DT, Decision Tree; RF, Random Forest; GBM, Gradient Boosting Machine.

Table 1 . Final summary of the list of features used to train the machine learning algorithms. Model Category Feature
Baseline QuestionnaireShort Form-12 (SF-12) Mental Component Physical Activity Scale for Elderly (PASE) score

Table 2 .
List of the baseline features and their most populated subgroup, with the total number (N) and percentage (%) of observations recorded at that level in Osteoarthritis Initiative (OAI) and Multicenter Osteoarthritis Study (MOST) databases.

Table 3 .
Area Under Curve (AUC) scores (with 95% Confidence Intervals) of six machine learning algorithms that underwent training and internal tests for clinical and imaging datasets.

Table 4 .
F1 scores of six machine learning algorithms that underwent internal and external tests for clinical and imaging datasets.

Table 4 .
F1 scores of six machine learning algorithms that underwent internal and external tests for clinical and imaging datasets.

Table 5 .
Top five most influential features in the best performing Gradient Boosting Machine (GBM) model for clinical and imaging datasets.