Machine Learning Prediction of Treatment Response to Biological Disease-Modifying Antirheumatic Drugs in Rheumatoid Arthritis

Background: Disease-modifying antirheumatic drugs (bDMARDs) have shown efficacy in treating Rheumatoid Arthritis (RA). Predicting treatment outcomes for RA is crucial as approximately 30% of patients do not respond to bDMARDs and only half achieve a sustained response. This study aims to leverage machine learning to predict both initial response at 6 months and sustained response at 12 months using baseline clinical data. Methods: Baseline clinical data were collected from 154 RA patients treated at the University Hospital in Erlangen, Germany. Five machine learning models were compared: Extreme Gradient Boosting (XGBoost), Adaptive Boosting (AdaBoost), K-nearest neighbors (KNN), Support Vector Machines (SVM), and Random Forest. Nested cross-validation was employed to ensure robustness and avoid overfitting, integrating hyperparameter tuning within its process. Results: XGBoost achieved the highest accuracy for predicting initial response (AUC-ROC of 0.91), while AdaBoost was the most effective for sustained response (AUC-ROC of 0.84). Key predictors included the Disease Activity Score-28 using erythrocyte sedimentation rate (DAS28-ESR), with higher scores at baseline associated with lower response chances at 6 and 12 months. Shapley additive explanations (SHAP) identified the most important baseline features and visualized their directional effects on treatment response and sustained response. Conclusions: These findings can enhance RA treatment plans and support clinical decision-making, ultimately improving patient outcomes by predicting response before starting medication.


Introduction
Rheumatoid Arthritis (RA) is a common inflammatory condition that primarily affects the small joints of the hands and feet, leading to disability, discomfort and deformity.It affects approximately 0.5-1% of the global population [1][2][3].
Biological disease-modifying antirheumatic drugs (bDMARDs) are an effective treatment for RA, typically prescribed when patients do not adequately respond to conventional synthetic DMARDs (csDMARDs) [4].According to the European Alliance of Associations for Rheumatology (EULAR) recommendations, regular monitoring every 1-3 months is essential in managing active RA.Therapy should be adjusted if a response is not achieved within six months [5].The goal of RA treatment is a sustained response, defined as remaining in remission or at least low disease activity (LDA).Patients on bDMARDs should have at least two follow-up visits within six months after achieving a response to ensure sustained response [6].Despite their effectiveness, 30-40% of patients do not respond to bDMARDs, and only 50% achieve a sustained response [7,8].Early response prediction before starting medication and appropriate treatment selection can improve disease control, reduce joint damage and alleviate pain [9,10].Non-responders may experience uncontrolled disease progression, leading to increased healthcare costs and a decline in quality of life [11][12][13][14].Although the costs of bDMARDs have decreased, their substantial cost still necessitates judicious use of healthcare resources [15].
Machine learning (ML) can identify patterns and relationships within data, offering potential benefits for predicting bDMARD outcomes and helping rheumatologists make more accurate treatment decisions [16].While previous studies have explored ML techniques for predicting treatment response in RA patients, they face several challenges, such as limitations associated with the availability and cost of imaging and gene expression data [17].For example, San Koo et al. [18] and Lee et al. [19] used clinical and imaging data to predict response at one year, but imaging data are often not available in routine practice and one year is too long to wait for therapy adjustment.Guan et al. [20] used clinical and genetic markers to predict response over 24 months, but genetic data are typically not available in routine clinical practice.Tao et al. [21] focused on predicting response at six months using genetic data, which is also often inaccessible.In addition, Rivellese et al. [22] and Yoosuf et al. [23] used gene expression to predict treatment response.However, such datasets are not available for every patient and are expensive for healthcare systems.To our knowledge, none of this research has predicted the sustained response at twelve months.
This study aims to overcome existing limitations by using routine clinical data before starting medication, to predict both the initial response and the sustained response.Moreover, we identify the most relevant baseline clinical features and their directional effects on treatment outcomes using Shapley additive explanations (SHAP), which enhances the model's interpretability for physicians.Our approach not only aligns with the EULAR gold standard for six-month response but also focuses on the sustained response, a critical yet often neglected aspect of RA treatment.We assess our strategy using clinical data from RA patients treated with bDMARDs, incorporating baseline and follow-up information from the initiation of treatment until any change or discontinuation of therapy.

Data Collection
In this study, we collected anonymized data from RA patients at Erlangen University Hospital in Germany.All patients met the ACR/EULAR 2010 classification criteria for RA [24].The research conducted in this study complied entirely with the principles outlined in the Declaration of Helsinki.The ethics committee of Friedrich-Alexander University (FAU) approved conducting the research in a cohort of patients with RA, with approval reference numbers 334-18 B and 333-16 B.
These patients were included from the time they initiated the bDMARDs treatment until they either changed the treatment or tapered these medications.For each patient, the study gathered clinical data for a baseline established at the time the patient started taking bDMARDs.Subsequent data were collected during patient follow-ups.All gathered clinical characteristics followed the same healthcare protocols and guidelines throughout the entirety of the study period.Demographic characteristics like age and gender, as well as disease-specific characteristics such as the type of medications the patients were taking in addition to their bDMARDs, like csDMARDs and non-steroidal anti-inflammatory drugs (NSAIDs), were recorded.Disease-specific characteristics also included C-reactive protein (CRP) level, erythrocyte sedimentation rate (ESR), rheumatoid factor (RF) and presence of cyclic citrullinated peptide (CCP) antibodies.Furthermore, disease activity measures, such as tender and swollen joint counts based on 28 joints (TJC28 and SJC28), visual analogue scales (VAS) for pain, patients and physicians' global disease activity, disease activity score in 28 joints based on CRP and ESR (DAS28-ESR and DAS28-CRP), clinical disease activity index (CDAI), simple disease activity Index (SDAI) and health assessment questionnaire (HAQ) were assessed.The study also took into account other comorbidities, for instance, asthma, diabetes, heart disease, etc.The list of all features can be seen in Tables A2 and A3.

Data Preprocessing
Before analyzing the collected data, a preprocessing step was necessary to address inconsistencies due to the routine data-collection process.It is important to note that rheumatologists collected data during their own diagnoses and treatments.However, some challenges emerged, such as the selective collection of features during follow-up visits.In some cases, specific features were recorded only once (rather than during every follow-up), like, for example, gender or certain features might have been omitted intentionally or accidentally, leading to various missing values in the raw data.To overcome this issue, we implemented a comprehensive imputation strategy.The proportion of missing data for each value is summarized in Table A1.
For the imputation of missing values, we employed different methods depending on the variable characteristics.For variables demonstrating linear correlations with other follow-up measurements, such as patient comorbidities, we utilized straightforward linear imputation techniques like the nearest available observation (NAO) and linear extrapolation [25].These imputations were conducted on an individual patient basis and remained independent of data from other patients.
For other missing values, such as DAS28 and the values used to calculate DAS28 (e.g., CDI Score, SDI Score SJC28 and TJC28), which did not show linear correlations with other variables, we used the Multiple Imputation by Chained Equations (MICE) method [26].MICE uses data from various variables to estimate the best possible prediction for each missing value by considering data from all patients collectively, not just individually.
We chose MICE because it can also handle datasets with up to 20% missing data.It creates multiple imputations for missing values by modeling each variable with missing data based on other variables in the dataset [27,28].This method ensures that the imputed values are as accurate as possible, maintaining the dataset's integrity for analysis [29].
Using imputation was crucial to avoid biases and inaccuracies from incomplete data, ensuring our analyses were robust and reliable.

Response and Sustained Response Groups
In this study, we defined responses to bDMARDs into two states: remission and low disease activity (LDA) [30].The sustained response also refers to the maintenance of remission or LDA states for at least six months [6].Based on the EULAR criteria, we used DAS28 score based on the erythrocyte sedimentation rate (DAS28-ESR) score to quantify remission and LDA.A DAS28-ESR score below 2.6 indicates remission, while a score between 2.6 and 3.2 indicates LDA [31].
In evaluating the effectiveness of bDMARDs treatment, patients were categorized based on their response and sustained response to the treatment.To assess the response, patients were labeled as "responders" if they met the remission or LDA criteria after 6 months and if they could maintain these criteria for an additional six months, requiring at least two visits within this period to be considered part of the sustained responder group.Patients who did not meet the responder criteria were categorized as "non-sustained responders", regardless of their DAS28-ESR values within the second six months.

Predictive Classification Models
In order to predict patient response and sustained response to bDMARDs treatment, we employed two separate machine learning models, as illustrated in Figure 1.Both models utilize the same clinical data to predict the patient's response to treatment at the sixmonth follow-up and estimate their sustained response to treatment after twelve months, respectively.For both models, we provide information regarding the most important clinical features influencing the model's outcome.We evaluated multiple classification models to identify the most effective approach.Five machine learning classifiers were trained with the selected features: Support Vector Machine (SVM) [32], Random Forest [33], Extreme Gradient Boosting (XGBoost) [34], Adaptive Boosting (AdaBoost) [35] and Knearest neighbor (KNN) [36].

Feature Importance and Interpretability
To make our models more reliable and easier to understand, we found and selected the most important features.We used a technique called Random Forest (RF) to figure out which features were crucial [37].Then, we ranked them based on their importance.Nevertheless, the RF algorithm alone does not provide information regarding the direction in which these variables influence the outcome of predictions.To address the problem, we used Shapley additive explanations (SHAP) [38,39].SHAP computed the difference in model output with and without each feature.This resulted in a SHapley value for each feature, which not only indicated its contribution but also the direction of its effect on the model's predictions.This information is crucial for clinicians to identify the best variables that significantly influence the prediction of response and sustained response in bDMARDs treatment.

Model Selection
We used a nested cross-validation methodology for training, validating and testing prediction models [40].This technique guarantees our models' robustness and best performance when processing real-world clinical data, which is essential for healthcare applications.
The nested cross-validation process consists of two main parts: an outer loop and an inner loop.In the framework of the outer loop, we first split our dataset into two main segments: an 80% training set and a 20% test set.
The training data were subsequently split into five different subsets, or "folds", within the inner loop.Here, we used hyperparameter tuning and feature selection.Using four folds for training and one for validation, we explored different combinations of features and hyperparameters in each iteration.After choosing the features, we used a grid search for each classifier, assessing every possible combination of hyperparameters.The search for the selection of the best hyperparameters was based on achieving the maximum accuracy on the validation dataset.This procedure was iterated five times, ultimately identifying the most effective hyperparameters and features for classifiers.Table A4 provides a list of the hyperparameters and the search space for each classifier.
In the second stage, we trained and evaluated our classifiers within the outer loop of the nested cross-validation method, utilizing the chosen features and the best hyperparameters identified in the first step.In this phase, we assessed performance metrics using the test dataset.
We repeated this method four times, using one of the remaining test folds as the testing dataset and the other as the training dataset in each iteration.For each new training dataset, the inner loop is repeated five times, resulting in a 5 × 5 cross-validation process.In addition, within the outer loop, we computed SHAP values for the features that were chosen [41].After concatenating these SHAP values, we were able to identify the most significant features in our models.Figure 2 presents a visual representation of this nested cross-validation process.
To evaluate the performance of the classifiers in predicting response and sustained response, we utilized four evaluation metrics: accuracy, Area Under the Receiver Operating Characteristic Curve (AUC-ROC), Matthews Correlation Coefficient (MCC) and F1 score.The cut-off threshold for the ROC curves was set to 0.5.This threshold is commonly chosen as it assumes equal costs for false positives and false negatives.
Using the outer loop of nested cross-validation, we obtained five values of each metric for each classifier prediction.The best classifiers for response and sustained response prediction were selected based on a combination of the highest mean of the evaluation metrics and also taking into account standard deviation as an indicator of variability.This approach allowed us to prioritize classifiers with both strong average performance and relatively low variability.

Software
The machine learning models and analysis scripts used in this study were developed using Python 3.9.The code and libraries used in this study are available on GitHub.

Patients Characteristics
Among the 183 RA patients screened, 154 had at least one follow-up at six months from baseline and at least two follow-ups within the subsequent six months.The remaining 29 patients were excluded as they did not meet these criteria.Table A2 presents the baseline clinical features of these 154 RA patients, stratified by their response at six months.Additionally, Table A3 provides a stratification based on sustained response.
Out of 154 RA patients, 98 were identified as responders to the treatment at the sixmonth follow-up, and subsequently, 66 of these responders were recognized as having sustained response.The distribution of patients across these groups is illustrated in Figure 3, highlighting that while 64% of patients met the response criteria, only 43% were able to maintain a sustained response.

bDMARDs Response Prediction
We assessed five classifiers to predict patient response to bDMARDs treatment and classified them into two groups: responder and non-responder.Our dataset of 154 RA patients was divided into folds within the inner loop of the nested cross-validation process, with roughly 31 patients included for training and validation for each fold.This distribution meant that, for the outer loop, roughly 123 patients (80% of the total of 154), divided into five folds for the inner loop, were utilized for training in each of the five iterations, while the remaining 31 patients (20% of the total of 154), formed the test set.This made sure that every patient was fairly represented during the training and testing phases enabling thorough evaluation of the model's functionality.
From all classifiers, XGBoost outperformed the others, achieving the highest accuracy, AUC-ROC, MCC and F1 score.XGBoost demonstrated mean values of 0.851, 0.91, 0.714 and 0.878 for accuracy, AUC-ROC, MCC and F1 score, respectively (Table 1).Figure A1 shows the ROC curves for the five outer test folds of the classifiers.Additionally, Figure 4 displays the SHAP plots.The figure on the left side of Figure 4A displays the mean of absolute SHAP values, which shows the baseline features in descending order of importance for predicting the response to treatment using XGBoost.The key clinical features that are crucial for predicting treatment response after six months include DAS28-ESR, VAS for physician, HAQ score, VAS for pain assessment, Body Mass Index (BMI), VAS for patient, TJC28, age, CDAI score, SDAI score, ESR level, gender, SJC28, RF, csDMARDs taking, CRP level and NSAID usage, respectively.There is a positive correlation between lower values of DAS28-ESR and a higher probability of being classified as a responder after 6 months of therapy with bDMARDs.

bDMARDs Sustained Response Prediction
We also used five classifiers to predict sustained response to bDMARDs treatment.The assessment metrics presented in Table 2 reveal that the AdaBoost classifier achieved the highest mean accuracy, AUC-ROC, MCC and F1-Score for distinguishing patients based on their sustained response.The mean values for AdaBoost were 0.856, 0.842, 0.68 and 0.759 for accuracy, AUC-ROC, MCC and F1-Score, respectively.Figure A2 shows the ROC curves for the classifiers across the five outer folds of the nested cross-validation.Figure 4B presents the SHAP plots for the AdaBoost classifier, highlighting the significant baseline features and their impact on sustained responses.The primary features for predicting sustained response are listed in descending order of importance, including the DAS28-ESR Score, HAQ Score, SDAI Score, VAS pain, age, VAS for patients, BMI, TJC28, RF, CDAI Score, DAS28-CRP score, CRP level, ESR level, VAS for physicians, asthma status, csDMARD usage, RF and gender.The last four attributes have a negligible impact.As previously stated, the SHAP method enhances our comprehension of how features influence outcomes and provides a clearer indication of the most influential features by analyzing the distribution of red and blue dots.For example, patients with lower DAS28-ESR and HAQ scores at the start of treatment have a higher likelihood of achieving sustained response.

Discussion
Our study utilized routine clinical data and machine learning methods to predict the response to bDMARDs treatment at six months and the sustained treatment response after twelve months.We achieved high predictive accuracy, with the XGBoost classifier showing an AUC-ROC of 0.910 for initial response prediction and the AdaBoost classifier demonstrating an AUC-ROC of 0.842 for sustained response.The robustness of our models was ensured through nested cross-validation, with SHAP values providing insights into feature importance and directionality.
Using baseline data, we can predict patient outcomes before starting the medication, allowing for more accurate and efficient care by quickly identifying patients unlikely to benefit from these therapies.This approach optimizes follow-up schedules, improves overall treatment effectiveness and better allocates healthcare resources.
Our study addresses gaps in previous research by aligning endpoints with EULAR standards and focusing on sustained response, thus enhancing the practicality and applicability of our findings.The emphasis on baseline characteristics provides clinicians with valuable guidance for making informed treatment decisions before prescribing medication.This targeted approach is crucial for effective patient management in rheumatology.
Despite these advances, our study has certain limitations.Its single-center nature may limit the generalizability of the findings.Future research should validate these predictive models in diverse settings and among larger patient groups to confirm their broader applicability and strengthen their predictive power.Additionally, expanding the scope of our models to include a wider range of DMARDs could significantly enhance the ability to tailor treatment plans.By extending the predictive capabilities to various DMARD categories, we aim to provide clinicians with a comprehensive decision-making tool, ensuring each patient receives the most suitable and effective treatment.
In conclusion, our study presents a promising approach for predicting bDMARDs response and sustained response using routine clinical data.By highlighting the most critical features and explaining how each clinical feature can influence response and sustained response, our predictive models can serve as decision support tools to help rheumatologists make more informed decisions when prescribing bDMARDs.

Institutional Review Board Statement:
The study was conducted in accordance with the Declaration of Helsinki, and approved by the Ethics Committee of Friedrich-Alexander University (FAU) (protocol code 334-18 B and 333-16 B, were approved on 9 October 2018).
Informed Consent Statement: Patient consent was waived as there was no intervention other than routine clinical data collection and patients were anonymized.

Figure 1 .
Figure 1.Data-processing flowchart showing patient selection, labeling strategy and delivery to the respective prediction model-selection units.

Figure 2 .
Figure 2. Flowchart illustrating the nested cross-validation process for feature selection, hyperparameter tuning and model selection.The nested cross-validation consists of five outer loops and five inner loops.In the inner loop, the optimal combination of features and hyperparameters is determined.During each round of the outer cross-validation, the SHAP values of the selected features are calculated on the test folds (Image adapted from[42]).

Figure 3 .
Figure 3.The proportion of responders and non-responders (after 6 months), and sustained responders and non-sustained responders (after 12 months).

Figure 4 .
Figure 4.The figure shows the most effective baseline features and their impact in predicting responses (A) using the XGBoost classifier, as well as the sustained response (B) through the implementation of AdaBoost classifiers.The SHAP values for the chosen features, which were obtained by the Random Forest (RF) model, are shown for predicting both the response and the sustained response.The figures on the left side show the average of absolute SHAP values and the mean influence of these features on the model.In contrast, the figures on the right side illustrate SHAP values, where positive values indicate a higher likelihood of response and sustained response, while negative values suggest the opposite.Vice versa, smaller SHAP values correspond to a lower probability of response and sustained response.The model for each patient is visually represented by a dot, where dots with greater SHAP values are shown in red and dots with lower values are represented in blue.For binary variables such as gender and diseases (e.g., male/female and presence/absence of disease, respectively), red dots indicate the presence of the condition or female gender, while blue dots represent the absence of the condition or male gender.The y-axis shows the most significant clinical features at the baseline for making these predictions.The right plot visually represents the direction of each mentioned baseline feature using a dot distribution.Red dots indicate greater values, while blue dots indicate lower

100 100 gamma [ 1 × 4 Figure A1 .Figure A2 .
Figure A1.ROC Curves for Nested Cross-Validation: Each curve represents the ROC for a test fold in the outer loop of the nested cross-validation process used for predicting response at six months.

Table 1 .
Results of predictive classifiers of response.

Table 2 .
Results of predictive classifiers of sustained response.
Note: Mean (±SD) evaluation scores of classifiers that classify the RA patients into two groups (Responder and Non-responder to bDMARDs).AdaBoost is the best classifier for classifying responses.

Table A2 .
Baseline Clinical Features of RA Patients, Stratified by Response to bDMARDs After 6 Months.
Mean (±standard deviation) and percentage of the population for each variable at baseline for RA patients treated with bDMARDs.p-values indicate the statistical significance of differences between responders and non-responders.

Table A3 .
Baseline Clinical Features of RA Patients, Stratified by Sustained Response to bDMARDs After 12 Months.
Mean (±standard deviation) and percentage of the population for each variable at baseline for RA patients treated with bDMARDs.p-values indicate the statistical significance of differences between sustained responders and non-sustained responders.

Table A4 .
Configuration space and the best hyper-parameters of each classifier.