Assessing the relevance of mental health factors in fibromyalgia severity: A data-driven case study using explainable AI

A


Introduction
Fibromyalgia (FM) syndrome is a chronic, painful, noninflammatory rheumatic disease with a high negative impact on quality of life [1].While sensitized pain mechanisms are a key factor in FM, mental health and psychological factors cannot be ignored, as patients often experience depression, pain catastrophizing, and kinesiophobia [2].
Artificial Intelligence and Machine Learning (AI/ML) have shown promise in supporting clinical decision-making for various diseases, but the interpretability of these models remains a challenge hindering their adoption in clinical practice [3].The emerging field of eXplainable AI (XAI) aims to address this issue by providing clinicians with understandable and data-driven decisions to support personalized and trustworthy diagnoses and treatments.XAI can also help clinicians, patients, health regulators, and developers work together in the co-development of AI clinical decision support systems [4].
AI/ML has been applied to fibromyalgia (FM) to aid in diagnosis and identify subgroups of FM patients using classification algorithms [5,6].These algorithms utilize various data sources, including neuroimaging, psychological and social variables, and self-reported patient data [5,7,8].Recent studies have incorporated XAI techniques to analyze feature relevance in the model outputs [9,10].While psychological variables (anxiety, depression, psychological trauma) have been shown to be significant predictors of FM severity, the contribution of pain perception versus mental health status has not been fully explored [10,11].A comparison of the importance of these variables could lead to a new perspective on FM treatment and generate discussion within the FM community with implications in the disease treatments.
One of our previous studies found a moderate direct association between pain catastrophizing and the size of referred pain areas, suggesting that those with higher negative expectations about pain may experience higher pain intensity [12].From a multidisciplinary discussion between clinicians and data science of the research team, a datadriven approach through co-development process was proposed to assess the influence of mental health variables on FM.Therefore, clinical experts firstly underwent basic training in ML/AI to set up research goals relevant to the clinical association between mental health and FM severity.Regulatory aspects such as study protocol description, ethical approval, and registry in clinicaltrials.govwere addressed before data collection.An exploratory data analysis was conducted by data scientists and clinicians to refine research objectives and select target outcomes.Finally, an explainable FM classification model was developed through multiple iterations and validated with the current clinical expertise in FM provided by the team's clinicians.This research aims to develop an explainable ML classifier for different FM severity categories and explore the association with mental health and pain perceived variables.Datadriven findings will be validated with clinicians based on current FM treatment practices.The study adheres to the standardized practices of reporting prediction models in medicine, adopting the TRIPOD statement guideline with its 22-item checklist (Appendix A Table A.1) [13] along with IJMEDINF guidelines [14] as supplementary material.
The remainder of this research paper is as follows: Section 2 describes the data and methods used in this study, including AI/ML algorithms, metrics, and XAI techniques.Section 3 presents the results of the study, including classification performance and explanations obtained through XAI techniques.Sections 4 and 5 contain the discussion and conclusions, respectively.

Study data acquisition
166 participants were recruited from a Spanish local FM association with the following inclusion criteria: FM diagnosed according to the American College of Rheumatology [15], age above 18 years, and properly understanding of spoken and written Spanish.The exclusion criteria considered a diagnosed psychiatric disorder or non-controlled rheumatic pathology.Participants received study protocol information before giving informed consent.The observational cross-sectional study, adhered to the Helsinki declaration and STROBE statement [16], was approved by the Ethics Committee for clinical research in the Talavera de la Reina health area (Spain) (25/2021) 2 and conducted from October 2021 to Februrary 2022.
The demographic and clinical data recorded at baseline were age, sex, study level, marital status, job situation, weight, height, and years of FM disease.Pain intensity was quantified through a 0-100 visual analog scale (PainVAS) representing "complete absence of pain" and "the worst pain imaginable" respectively.The referred pain area (PainPIXEL) was reported by the patient, after being applied a pressure algometer on the infraspinatus muscle [12], through the digital application Navigate Pain [17] in terms of the number of pixels marked.The McGill Pain Index (PainMcGill) [18] was also used to evaluate the extent of suffered pain by the participants.Mental health variables, including anxiety, depression, and pain catastrophizing, were measured using the State-Trait Anxiety Inventory [19], Beck Depression Inventory-II (BDI) test [20], and the Pain Catastrophizing Scale [21], respectively.Muscle strength was evaluated using a dynamometer following the guidelines proposed by Ruiz et al. [22].

Pre-processing, machine learning algorithms, and metrics
The authors of the study used their framework SCI-XAI3 [27] to obtain the optimal FM severity classifier model in terms of performance and number of selected features.The framework firstly split with a ratio of 70/30 and stratified by the target feature separates de dataset into development and test set respectively to evaluate the model performance with unseen data and avoid data leakage in the model training.
Prior to fitting the development set in the ML algorithm, the SCI-XAI framework undergoes an exhaustive search across three sets of hyperparameters within a 5-fold cross-validation approach using a brute force algorithm provided by the Python scikit-learn library's Grid-SearchCV method.A graphical diagram illustrating this framework has been included as supplementary material (Appendix B Fig. B.1).This algorithm generates multiple combinations of hyperparameter values and selects the optimal combination based on specific criteria, in this case, the highest AUC-ROC (area under the receiver operating characteristic curve).First, imputation strategies are considered, involving the use of mean or median for numerical features and the most frequent value for categorical features.Subsequently, the feature selection step is applied through different techniques depending on the feature type: ANOVA for numerical features, Chi-squared for categorical, and Mutual information and Recursive Feature Elimination for both types [28].To determine the optimal number of features, the brute-force algorithm iteratively applies each of these techniques within a range from 1 to 15 features, corresponding to the total number of features in the dataset.Next, the cross-validation approach is applied to the development set by fitting several classification algorithms such as logistic regression decision tree, random forest and its balanced version, extra trees, adaptative boosting (AdaBoost), and random undersampling adaptative boosting (RUSBoost).The models were trained by considering the weights of the classes due to the class imbalance among the different FM severity classes.Despite their initial opaqueness, the use of ensemble trees algorithms to develop the model is due to their potential for applying the extra XAI technique, in addition to other post-hoc common techniques, based on the Mean Decrease Impurity (MDI), as described in the following Section 2.3.
The following classification metrics considered to evaluate the model performance were accuracy, balanced accuracy, sensitivity, precision, f1 score, the area under the curve of receiver operating characteristics (auc_roc), Matthews coefficient, and Kappa's score.To assess the statistical significance of the training results provided by the classifiers, we employed the Friedman test with alpha level of 0.05 [29].Due to the existing imbalance in the target features, their classes weights were considered in models building, and the best performance among the different evaluated models was determined by assessing the auc_roc.

XAI techniques
Balancing accuracy and explainability is crucial for the adoption of AI models in healthcare.Machine learning (ML) algorithms can be transparent or black box, with transparent models (e.g, logistic regression, decision trees) being interpretable by design, and black box models (e.g., ensemble trees, neural networks) requiring post-hoc techniques for explanation.This research considers post-hoc techniques, including SHapley Additive exPlanations (SHAP) and Partial Dependence Plot (PDP), for explaining model decisions.SHAP calculates an additive measure called Shapley values for each feature of a single instance, providing local and global explanations [30].PDP calculates the marginal effect of a given feature on the predicted outcome over its range of observed values, depicting its influence on global prediction probability [31].For ensemble tree algorithms, Mean Decrease Impurity (MDI) shows the relevance of features based on the number of times a feature is used to split a tree classifier's node weighted by the number of samples within the split.

Dataset description
Table 1 displays the descriptive statistics of the clinical study data, revealing an imbalance in the target outcomes (PDS and FIQ).The authors carefully considered this imbalance when training and evaluating models, selecting hyperparameters based on the different classes' weights.

FM severity classification with PDS as target feature
The authors developed a classification model to categorize FM severity with the PDS scale as the target outcome, using 15 features and excluding the FIQ score since it measures the same effect as PDS.Out of 28 cases, PSD information was not recorded, so they were excluded.The initial stratified split generated an allocation in the very severe/severe categories of 78/18 cases in the development set, and 34/8 in the test set.The balanced random forest achieved the highest auc_roc of 0.81 (std 0.07) when training the model with the train set, as shown in Table 2.The Friedman tests indicated statistically significant results (p ≪ 0.001).The optimal model's performance was evaluated using unseen data (test set) and the feature selected by the SCI-XAI framework, as depicted in Table 5.a.The explainability of the optimal model was analyzed through the post-hoc techniques SHAP (Fig. 1), and PDP (Fig. 2).The MDI was also used, given the ensemble tree type of the optimal model, and the feature importance scores were obtained with anxiety (0.266), depression (0.218), job situation (0.206), study level (0.191), painMcGill (0.1037), and sex (0.015) listed in descending order.

FM severity classification with FIQ as the target feature
Due to the strong imbalance in FIQ categories (9 mild, 42 moderate, and 115 severe cases), two approaches were adopted for FM severity classification: withdrawing the 9 participants with mild severity or considering all FIQ cases.Thus, the first model entailed a binary classification between severe/moderate FM categories with the following stratified allocation split into 80/29 cases (development set) and 35/13 (test set).Table 3 shows the development classification results for this prediction model where logistic regression achieved the best auc_roc: 0.83 (std 0.08).The training set results were statistically significant according to the Friedman test (p ≪ 0.001).This optimal model's performance with the test set and the features selected are shown in Table 5. b.As a transparent algorithm, logistic regression assigns a coefficient (odds ratio) to each feature related to the relevance towards the given output as following: anxiety (8.104), catast (6.081), depression (4.810), painPIXEL (3.341), painVAS (3.120), sex (2.1923), painMcGill (2.1893), job situation (1.230).
The second FIQ approach encompassed a FM severity multiclass classification (mild, moderate, and severe) where the class weights are considered in the classification metrics, e.g., one-versus-rest (OVR) approach is used in roc_auc.The stratified split allocated in severe/ moderate/mild classes: 81/29/6 cases in the development set, and 34/ 13/3 cases in the test set.Tables 4 and 5.c show, respectively, the random forest classifier yielding the best performance with the development test and 8 features selected, and the performance with the test set.The results shown in Table 4 contain statistically significant differences according to the Friedman test (p ≪ 0.001).Additionally, Fig. 3 displays optimal model's confusion matrices with the development (Fig. 3.a) and test set (Fig. 3.b).The explainability analysis of the Random Forest for the FIQ multiclass approach was conducted through PDP plots (Fig. 4) to determine the relative increase or decrease trend of probability for each class, and MDI obtaining following the relevance of the input features anxiety (0.202), depression (0.176), catast (0.148), painMcGill (0.146), painVAS (0.143), painPIXEL (0.080), age (0.092), marital status (0.013).
* Feature with missing data; † Target outcomes.

Discussion
Assessing mental health's impact on FM severity could improve treatment by providing psychological support.Currently, international Guidelines prioritize non-pharmacological therapies for FM, with medication as a supplement [35].This study offers a data-driven case study where mental factors are observed to have a stronger association with FM severity compared to pain factors.This finding has the potential to underscore the importance of studying non-pharmacological interventions and advocating for mental well-being support for FM patients.
The best performance in FM severity classification through the PDS scale was achieved by balanced random forest with auc_roc of 0.81 and 0.66 with,respectively, the development and test set.Overfitting may be justified by the small dataset size (166 participants) and the class imbalance (26 severe and 112 very severe).The SCI-XAI framework allowed for feature optimization, removing 9 of the 15 features that did not contribute to classification performance or explainability analysis.This step improved the model's interpretability by eliminating nonuseful informative features, which do not enhance the classification performance.Mental health factors (anxiety and depression) were found to be more relevant than pain variables (only McGill Pain index) in the feature selection phase This was supported by the MDI explainability analyses, which highlighted anxiety and depression as the most significant features for very severe FM followed by job situation and study level.SHAP analysis also identified anxiety and depression as the second and third most relevant features (Fig. 1), while 'unemployed' in the job situation variable was the top feature associated with an increased probability of very severe FM.Major depression and fibromyalgia frequently co-occur, contributing to lower employability, according to Kassam et al. [32] and Liedberg et al. [33], who also found that unemployment is associated with higher FM severity.Conversely, SHAP identified university degree education level as the least relevant to very severe FM level, which Pérez-Aranda et al. attribute to physically demanding jobs accessed by low-educated people, triggering central nervous system dysregulation responsible for FM symptomatology [11], rather than cultural level.PDP in Fig. 2 shows an up to 20 % increase in probability contribution for anxiety values over 30, similar to depression when over 10.Thus, BDI cut-off values (10-18 for mild to moderate, 19-29 for moderate to severe, and 30-63 for severe depression) can associate moderate to severe depression levels with very severe FM, according to Aaron et al. [34].
Logistic regression achieved optimal performance with fairly robust auc_roc of 0.83 and 0.92 with development and test sets, respectively, for FM severity binary classification using FIQ.Odds ratio coefficients showed mental health variables (anxiety, pain catastrophizing, and depression) dominating over pain variables (referred pain location area, pain intensity VAS, and McGill index) in discriminating severe FM and moderate FM cases.FIQ multiclass classification, performance results were satisfactory with auc_roc 0.91 and 0.86 for development and test sets respectively.However, the high imbalance in mild and moderate FM class reduced the balanced accuracy metric (0.53 with test set), as seen in the confusion matrix where all mild FM instances were misclassified, and 8 out of 13 were correctly classified for moderate FM.In the multiclass approach to FM severity classification, the random forest algorithm yielded the best performance.The MDI explainability technique showed that mental health variables were, again, more relevant for classifying FM severe cases than pain variables.This finding was supported by the PDP analysis, which revealed that anxiety, depression, and pain catastrophizing had a probability contribution greater than 0.20,

Table 2
Cross validation results of prediction model for FM severity by using PDS scale (The optimal model concerning auc_roc metric is highlighted in bold).Acc: Accuracy, Bal.Acc.: Balanced Accuracy, Sens: Sensitivity, Prec: Precision.AUC_ROC: Area under curce receiver operating characteristic, MCC: Matthews Coefficient.compared to McGill index, pain severity visual analog scale, and referred pain location area, which had a minor contribution.Similar to the results obtained with the PDS as the target feature, moderate FM cases were associated with anxiety values between 15 and 30, and severe FM cases with anxiety values above 30.Depression values above 10 on the BDI scale were also indicative of severe FM cases.In the multiclass approach using PDP, all variables except anxiety consistently showed higher probabilities for severe FM cases, which might be due to the high prevalence of severe cases in the dataset, resulting in a one-versus-rest approach (i.e., severe versus moderate and mild).As a general contribution and through a data-driven case study, this research paper goes beyond the idea that mental health factors are associated with increasing FM severity.To the best of our knowledge, no other research studies have presented that psychological factors' relevance is higher than pain factors.This finding could gather clinicians' attention when determining FM treatments.Thus, the result of this study might be potentially relevant to support the viability of nonpharmacological treatment of FM severity, such as promoting the mental health of patients, following the recommendations of FM international guidelines.The results obtained were shared with clinicians to be validated from their clinical expertise.Not only did clinical experts value these results as potentially applicable to clinical routine, but they also recalled some patients' comments about perceiving less pain when their anxiety and depression were at low levels.However, it's important to approach this statement with caution, as the causality between mental factors and FM severity has not been studied in this work.Therefore, we propose to further investigate the association between FM severity and mental factors, exploring their causal relationship in future research.In addition, the clinical study will be extended by developing an AI/ML model for FM severity prognosis to gain data-driven evidence on disease severity evolution through non-pharmacological treatment.
The study's main limitation arises from the imbalance in the target features, both with PDS and FIQ, resulting in overfitting and misclassification in the multiclass approach with FIQ.To mitigate the impact of this imbalance, we consider the weights of the classes during the training phase.However, the FM severity categories distribution aligns with other studies that use FIQ to evaluate the severity of the disease [26,35].In addition, the high imbalance in the FIQ multiclass approach with 9 mild cases (6 and 3 cases in the train and test set respectively) affected the selection of the number of folds in the cross-validation, i.e. 5, in order to have instances of all classes in the different folds despite the small dataset.These issues could be mitigated by enrolling more participants while maintaining category distribution to guarantee a robust performance in real clinical practices.Using data resampling techniques such as SMOTE or TOMEK is not recommended in this context because it could alter the distribution of categories in the FM population, potentially compromising the model's performance in a real clinical setting.Missing data were addressed through imputation or discarding samples, for instance to those participants with no values in PDS features, further reducing the dataset's size.These limitations may  have impacted the model performance.Despite these limitations, the study's results support the significance of mental health in FM severity, which may guide clinicians in determining FM treatments.

Conclusion
Mental health is considered a significant indicator for FM severity, however, pharmacological interventions are currently dominant therapies.FM clinical experts inquire about psychological factors compared to pain factors, to orient treatment approaches from pain reduction to mental wellbeing improvement.
AI/ML has nowadays a significant impact on healthcare field by developing clinical DSS.XAI emerges to address interpretability issues with AI models, especially in sensitive environments such as healthcare where understanding the model's output is crucial for correct and secure concerning the users affected, in this case, patients.
This study used a data-driven approach to examine the influence of mental health variables on FM severity.FM clinicians and data scientists collaborated to develop three classifier models for FM severity using different scales, with strong performance (auc = 0.81 with PDS scale, and auc = 0.83, 0.91 with FIQ scale).The explainability analysis suggested mental health factors and other anxiety related situations (like unemployment) had a stronger association with severe and very severe FM categories than pain factors.These findings were validated by clinicians according to current FM clinical evidence which support nonpharmacological interventions to decrease the FM severity as psychological distress reduction or the improvement of mental wellbeing.

Statements of ethical approval
The clinical study associated to this research was approved by the Ethics Committee for clinical research in the Talavera de la Reina health area (Spain) (25/2021).The clinical study is registered in in Clin-icalTrials.gov(NCT04918602).

Fig. 1 .
Fig. 1.SHAP summary dot plot of FM severity classification model by using PDS scale.Each dot represents the positive or negative attribution (calculated as shapely values) to the probability of being classified in FM very severe class according to PDS for the different features of every participant.The dot color denotes the value of the feature.

Fig. 2 .
Fig. 2. PDP plots of FM severity classification model by using PDS scale.For each feature, the vertical axis represents the quantitative contribution in the probability of being classified in FM very severe class according to PDS.The horizontal axis represents the feature's range of values contained in the dataset.

Fig. 3 .
Fig. 3. Confusion matrix of prediction model of FM severity by using FIQ scale with train set (a) and test set (b).

Fig. 4 .
Fig. 4. PDP plots of prediction model of FM severity classes by using FIQ scale.For each feature, the vertical axis represents the quantitative contribution in the probability of being classified in the different categories (depending on the color line) of FM severity according to FIQ class.The horizontal axis represents the feature's range of values contained in the dataset.

Table 1
Clinical study dataset description.

Table 3 Cross
-validation results of prediction model of FM severity by using FIQ with moderate and severe categories.Acc: Accuracy, Bal.Acc.: Balanced Accuracy, Sens: Sensitivity, Prec: Precision.AUC_ROC: Area under curce receiver operating characteristic, MCC: Matthews Coefficient.Classifier (num. of selected features)

Table 4
Cross-validation results of prediction model of FM severity by using FIQ with the 3 FM severity categories.Acc: Accuracy, Bal.Acc.: Balanced Accuracy, Sens: Sensitivity, Prec: Precision.AUC_ROC: Area under curce receiver operating characteristic, MCC: Matthews Coefficient.

Table 5
Test set results and features selected of prediction model for FM severity by using PDS scale (a); FIQ with moderate and severe categories (b); and FIQ with the 3 FM severity categories (c).Acc: Accuracy, Bal.Acc.: Balanced Accuracy, Sens: Sensitivity, Prec: Precision.AUC_ROC: Area under curce receiver operating characteristic, MCC: Matthews Coefficient.anxiety, depression, pain catastrophizing scale, McGill pain index, pain severity visual analog scale, referred pain location area, age, marital status