Factors influencing psychological distress among breast cancer survivors using machine learning techniques

Breast cancer is the most commonly diagnosed cancer among women worldwide. Breast cancer patients experience significant distress relating to their diagnosis and treatment. Managing this distress is critical for improving the lifespan and quality of life of breast cancer survivors. This study aimed to assess the level of distress in breast cancer survivors and analyze the variables that significantly affect distress using machine learning techniques. A survey was conducted with 641 adult breast cancer patients using the National Comprehensive Cancer Network Distress Thermometer tool. Participants identified various factors that caused distress. Five machine learning models were used to predict the classification of patients into mild and severe distress groups. The survey results indicated that 57.7% of the participants experienced severe distress. The top-three best-performing models indicated that depression, dealing with a partner, housing, work/school, and fatigue are the primary indicators. Among the emotional problems, depression, fear, worry, loss of interest in regular activities, and nervousness were determined as significant predictive factors. Therefore, machine learning models can be effectively applied to determine various factors influencing distress in breast cancer patients who have completed primary treatment, thereby identifying breast cancer patients who are vulnerable to distress in clinical settings.


Study participants and data collection
A total of 641 adult breast cancer patients aged 19 and above, who were registered at the Cancer Survivor Integration Support Center from April 2020 to July 2022 and satisfied the selection criteria, participated in this study.The selection criteria for participants included women aged 19 to 64 with breast cancer, without any psychiatric issues such as depression or any history of recurrence or metastasis.After completing primary treatment, the breast cancer patients could voluntarily register at the Cancer Survivor Integration Support Center to participate in the cancer survivorship program.A survey was conducted on the level of distress and related factors for patients who agreed in writing to participate after a nurse at the center explained the purpose of the survey.An analysis was conducted using these data collected from the Cancer Survivor Integration Support Center, with specific data selected that met the criteria for participant selection.All personal information, including patient names and hospital identifications, was removed before being provided for analysis.This study was conducted after obtaining approval from the Institutional Review Board of the hospital with which the cancer center was affiliated prior to receiving the data.

Measures
The National Comprehensive Cancer Network (NCCN) Distress Thermometer is a widely used screening tool for assessing psychosocial distress in cancer patients.When using this tool, patients are asked to circle the number that best describes the amount of distress that they have experienced over the past week and to indicate whether any of the items on the specified problem list have caused problems.The thermometer itself is unspecific; however, the problem list identifies the multidimensional categories that cause distress.The Distress Thermometer consists of an 11-point visual analog scale ranging from 0 (no distress) to 10 (extreme distress) and a 39-item problem list 17 .The patients rate the level of distress that they have experienced over the past week using the visual analog scale.The established cutoff score for further screening is four 12,18 .Patients are then asked to fill in the problem list that accompanies the visual image of the distress thermometer to verify whether (yes/no) they have experienced any of the listed problems over the previous seven days to identify the factors related to the distress 10,17 .The NCCN recommends incorporating the problem list for patients as part of the assessment to assist the provider in identifying sources of patient distress.The problem list consists of 36 problems under the following five grouped categories: spiritual/religious, practical, family, emotional problems, and physical problems 12 .

Statistical analysis
The χ2 test (Fisher's exact test) and t-test were conducted using the Jamovi program (version 2.3.21) to evaluate the level of distress and survival rate of breast cancer patients according to the distress problem list and the difference between mild and severe distress groups.Training and testing of the machine learning models were performed and the feature importance was verified and visualized using Python version 3.9.16.The specific analysis steps are described in the following.

Data preprocessing
Data preprocessing involves organizing data before analyzing them and using them to train models 19 .This process includes handling missing values in the collected data and transforming them into the necessary format for data learning through encoding.In this study, no significant results were observed in the demographic and treatment-related characteristics; therefore, the items in the list of distress problems were processed as dummy variables.Thus, data preprocessing was performed using the Pandas and NumPy libraries in the Python package.

Model training
Five machine learning models that are representative supervised learning algorithms for binary classification-Logistic Regression, XGBoost, Random Forest, Support Vector Machine, and CatBoost-were applied.They were www.nature.com/scientificreports/used to identify the relationships between categories in existing data and autonomously determine the category of newly observed data 19 .This study aimed to derive a high-performing model for predicting mild and severe distress groups as per the feedback of breast cancer survivors.The data were randomly divided at the ratio of 7:3 into a training dataset for building the predictive model through learning and testing dataset for validating the built predictive model.The training dataset was divided into five subsets and k-fold cross-validation was performed to mitigate overfitting and improve the robustness.Moreover, a grid search was conducted to identify the optimal hyperparameters for each model.This task was performed using the train_test_split and GridSearchCV functions in scikit-learn, which is a representative machine learning package.

Model testing
Model testing is the process of evaluating the performance of a machine learning model constructed through training 19 .In this study, the accuracy, precision, recall, F1 score, and AUC were used as the performance metrics, where higher values indicate better predictive power of the model.Accuracy is the ratio of data that were correctly predicted in the mild and severe distress groups.Precision indicates the ratio of subjects who were actually experiencing severe distress among those who were predicted to be experiencing severe distress.Recall is the ratio of subjects predicted to be under severe distress among the subjects who were actually experiencing severe distress.The F1 score is the harmonic mean of the precision and recall and represents a high value when precision and recall are similar.Finally, the AUC score is an area, which is represented as a percentage.It indicates the effectiveness and generalization of the classification performance of the model 19 .This task was performed using the accuracy_score, precision_score, recall_score, f1_score, and roc_auc_score functions in scikit-learn.

Feature importance
Feature importance is a metric that indicates the impact of each feature on predictions during the training of the machine learning model 19 .In this study, the top-10 variables with high relative importance were extracted and visualized based on the coefficient using the feature_importances_ and coef_ functions in scikit-learn.

Ethical considerations
This study was performed in accordance with the principles of the Helsinki declaration, and the procedures were followed in accordance with institutional guidelines.The study was approved by the institutional review board of the Ajou University (AJOUIRB-DB-2022-305), and all patients gave written informed consent.

Characteristics according to the distress experienced by participants and distress problem list
The average distress score of the participants was 4.35 (± 2.38), with 271 participants (42.3%) recording mild distress scores of < 4 and 370 participants (57.7%) recording severe distress scores of ≥ 4 (Supplementary Table 1).

Relationships between distress groups according to the demographic and treatment-related characteristics of the participants
When considering the relationships between distress groups based on the demographic and treatment-related characteristics of the participants (Table 1), no statistically significant differences were observed between the two groups in any of the characteristics.

Relationships between distress groups according to the distress problem list of the participants
The relationships between distress groups according to the list of distress problems selected by the participants are listed in Table 2.In terms of real-life problems, among the participants who responded that they had practical problems relating to child care (χ2 = 16.50, p < 0.001), housing (χ2 = 16.30,p < 0.001), insurance/financial aspects (χ2 = 12.10, p < 0.001), and work/school (χ2 = 7.47, p = 0.006), a higher proportion experienced severe distress rather than mild distress.In terms of family problems, among participants who responded that they had problems relating to dealing with children (χ2 = 12.80, p < 0.001), dealing with a partner (χ2 = 30.90,p < 0.001), and family health issues (χ2 = 5.26, p = 0.022), a higher proportion were classified in the severe distress group.In terms of emotional problems, a higher proportion of participants in the severe distress group reported problems with depression (χ2 = 38.60,p < 0.001), fears (χ2 = 25.40,p < 0.001), nervousness (χ2 = 20.80,p < 0.001), sadness (χ2 = 17.30, p < 0.001), worry (χ2 = 22.50, p < 0.001), and loss of interest in usual activities (χ2 = 15.90, p < 0.001).
In terms of physical problems, a higher proportion of participants in the severe distress group reported problems with appearance (χ2 = 16.20,p < 0.001), eating (χ2 = 6.

Comparison of distress-predicting performance for participants using machine learning models
The results of the various machine learning models and a comparison of their performances when determining the factors influencing the distress of participants are listed in Table 3.The accuracy scores of the Support Vector Machine, XGBoost, and CatBoost models were similar, with a value of 0.715.In terms of precision, the Support Vector Machine exhibited the highest value of 0.810, followed by XGBoost with 0.792 and CatBoost with 0.787.Moreover, Random Forest achieved the highest performance in terms of recall (0.821), followed by CatBoost (0.726) and XGBoost (0.718).Similarly, Random Forest exhibited the highest performance in terms of F1 score (0.771), followed by CatBoost (0.756) and XGBoost (0.753).The area under the receiver operating characteristic curve (AUC) score indicates that Support Vector Machine (0.721), XGBoost (0.714), and CatBoost (0.712) achieved the best performances in descending order (Fig. 1).

Importance of the top-10 variables derived from the Support Vector Machine, XBGoost, and CatBoost models for predicting the distress group of breast cancer survivors
The Support Vector Machine model showed the highest predictive performance based on the AUC score.The importance of the variables was ranked in the following order: dealing with a partner, work/school, worry, housing, fears, depression, loss of interest in normal activities, sleep, dealing with children, and nervousness.The XGBoost algorithm, which is known for its superior predictive performance, ranked the variables in descending order of importance as follows: depression, housing, appearance, dealing with a partner, work/school, fears, fatigue, pain, loss of interest in usual activities, nervousness.CatBoost, which was ranked third in terms of performance, rated the variables in descending order of importance as follows: dealing with a partner, depression, housing, fears, fatigue, family health difficulties, appearance, work/school, sleep, and pain (Fig. 2).

Discussion
When the machine learning models were applied to identify the factors for predicting distress in breast cancer survivors, Support Vector Machine, XGBoost, and CatBoost demonstrated superior predictive performance in terms of the AUC score.The factors that were determined as significant predictors were depression among emotional problems, dealing with a partner, housing and work/school among practical problems, and fatigue among physical problems.
The results of this study showed that the distress level of breast cancer patients after completing primary treatment was an average of 4.35 points, and the proportion of participants who were classified under severe distress based on the score of 4 points, which is specified in the NCCN guidelines, was 57.7%.Distress in breast cancer patients occurs from the time of diagnosis until the completion of treatment, and it persists in approximately one-third to one-half of patients even after the completion of primary breast cancer treatment 8 .The time after the completion of treatment is crucial for the adaptation of cancer survivors.However, at this point, patients who have experienced distress may have a slow recovery process and persistent physical and psychological symptoms 20 .During this period, breast cancer patients who experience ongoing distress may not only experience delays in adaptation and recovery in their daily lives 21,22 , but also face difficulties in readjusting to societal requirements, such as returning to work, thereby resulting in a lower quality of life.However, note that many www.nature.com/scientificreports/breast cancer patients do not receive any specific management beyond regular hospital visits after their treatment is completed.Therefore, the level of distress in breast cancer patients when primary treatment has been completed should be assessed, and concrete and practical measures to alleviate it should be determined.The high-performance machine learning models Support Vector Machine, XGBoost, and CatBoost were applied to identify the predictive factors of distress in breast cancer survivors.The results showed that the accuracy, F1 score, and AUC score of these models were considerably high, exceeding 0.70.The Support Vector Machine, XGBoost, and CatBoost models have been reported to demonstrate superior predictive performance in multiple studies on the prediction of psychological symptoms such as distress [23][24][25] .Support Vector Machine is an algorithm that identifies the boundary with the largest margin by setting a hyperplane between the data, and it exhibits low overfitting and superior classification performance 26 .The XGBoost model is an ensemble model of decision trees that achieves fast learning and classification speeds using parallel processing.It also exhibits superior predictive performance in classification and regression 27 .Furthermore, the CatBoost model achieves high accuracy for categorical variables using ordered boosting 28 .This study demonstrates that, compared with the traditional binary classification method of logistic regression, machine learning models exhibit not only improved overall performance metrics but also offer a more intuitive understanding of the relationships between multiple variables and their feature importance.Particularly, the Support Vector Machine model demonstrated superior classification performance relative to ensemble models such as XGBoost and CatBoost.This is attributed to the parallel combination of single models in ensemble methods, which can lead to issues such as increased computational time and overfitting 19 .The Support Vector Machine, with its sequential learning process, mitigates these issues.Furthermore, it exhibits high generalization performance on new data and robustness to outliers 19 , making it highly beneficial for clinical settings where identifying and understanding a multitude of factors is crucial.Regarding the importance of variables in predicting distress among breast cancer survivors, the results of the Support Vector Machine, XGBoost, and CatBoost models indicated that emotional problems such as depression, fears, worry, loss of interest in usual activities, and nervousness are significant predictive factors.Emotional symptoms such as depression, fear, and anxiety are reported to occur in breast cancer patients in a complex and clustered manner 29 .These symptoms are observed from the time of diagnosis and can last for more than 10 years after treatment has ended 30.These emotional issues have been identified as the most important variables that influence the quality of life and adaptation of breast cancer patients following primary treatment 22,31 .Considering that the association between these emotional symptoms and long-term survival rates in cancer patients has been established 32 , emotional symptoms must be monitored continuously and comprehensive approaches for mental health promotion, such as psychological support and counseling specifically designed for breast cancer survivors, must be implemented.
Furthermore, depression has a higher prevalence rate than other emotional symptoms 29 and is considered the most influential factor in impairing the return of patients to routine life 32 .In particular breast cancer patients who have completed primary treatment experience a decrease in attention and support from family and friends compared to that during the treatment period, which leads to a more severe level of depression in a psychologically and socially vulnerable state.Such high levels of depression may hinder the return of breast cancer patients to normal life, affecting their adaptation and transition as survivors, as well as increasing the risk of recurrence and mortality 33,34 .Therefore, efforts are required to detect depression promptly and deal with it.effectively.
The Support Vector Machine, XGBoost, and CatBoost models identified dealing with a partner, housing, and work/school among practical problems as the most influential factors, demonstrating superior predictive performance.The results indicated that breast cancer survivors with problems dealing with partners experienced higher distress.Close interaction with caregivers is a crucial source of emotional support for breast cancer patients during surgery and treatment, as well as during the post-treatment period when they return to their daily lives and adapt to various changes.This interaction plays a significant role in managing the physical and psychological issues caused by post-treatment symptoms, and it enhances the overall well-being and adjustment of the patients 35 .Previous studies have indicated a positive impact on the psychological and social adaptation of breast cancer patients when they experience a high level of intimacy with their partners 36,37 .Thus, we can conclude that strengthening the intimacy with a partner is an important intervention factor in alleviating distress.
In terms of housing, unstable housing situations of breast cancer survivors may arise from financial risks associated with cancer diagnosis and treatment, particularly among low-income individuals 38 .The housing issue, which has a significant impact on family economics, lowers the living standards of households, increases the risk of contracting diseases, exacerbates distress, and impairs treatment compliance 39 .
Another important influencing factor that was identified in this study is the distress associated with returning to work, which functions as a social and financial safety net 40 ."Return to work" serves as an indicator of improved self-esteem in breast cancer survivors and signifies their social reintegration from being patients to becoming survivors 41 .Furthermore, the increase in income by returning to work is an important aspect of cancer recovery, providing economic stability and a sense of security for breast cancer survivors 42 .However, many breast cancer survivors experience difficulties in the process of returning to work owing to various physical, psychological, and social issues caused by treatment 43 .After returning to work, cancer survivors face several challenges such as a decrease in social status or rank, unwanted job replacements, issues with employers and colleagues, and diminished physical abilities.These difficulties make it challenging for them to continue working and cause distress 40,43 .Thus, practical support measures must be established to assist breast cancer survivors in successfully returning to work and achieving economic stability 42 .
Finally, the machine learning models identified physical symptoms such as fatigue, sleep, and pain as key predictors of distress among breast cancer survivors.Fatigue, sleep disorders, and pain are frequently reported as a symptom cluster in breast cancer survivors and can persist for more than five years 44 .Moreover, the symptom cluster in breast cancer survivors exhibits various patterns depending on the severity of the symptoms, affecting physical and social functioning, thereby leading to distress and reducing quality of life 45 .Consequently, for breast cancer survivors to return to their normal lives successfully following completion of primary treatment, intervention is required to monitor and manage the symptoms experienced by these survivors effectively.
This study is significant as it assessed the level of distress in breast cancer survivors and identified factors that influence the widespread occurrence of distress using machine learning techniques.However, the findings of this study must be interpreted with caution because of the following limitations.First, the research findings cannot be generalized because this study was conducted only on breast cancer survivors registered at the Cancer Survivor Integration Support Center located in the Gyeonggi region of South Korea.Therefore, future research calls for an expanded nationwide investigation of research participants to assess the distress of breast cancer survivors in detail.Second, this study exhibited limitations in identifying the changes in distress experienced during the transition from breast cancer patients to breast cancer survivors and determining the factors that influence these changes.Therefore, a longitudinal study should be conducted to identify the patterns of distress changes and related factors in breast cancer survivors.

Conclusions
This study demonstrated that approximately 50% of breast cancer patients experience distress.Factors affecting the distress of breast cancer survivors, including emotional problems such as depression, practical problems such as dealing with a partner, housing, work/school, and physical problems such as fatigue, were identified using machine learning models.When applied to hospital information systems in the future, the developed and validated model for screening severe distress in breast cancer patients can provide evidence for identifying

Figure 1 .
Figure 1.Comparative analysis of distress prediction performance in breast cancer survivors using machine learning models based on AUC scores.

Figure 2 .
Figure 2. Top-10 features in order of importance as calculated using XGBoost, Support Vector Machine, and CatBoost.

Table 1 .
Demographic and treatment-related factors of patients in mild and moderate-severe distress groups.

Table 2 .
Problem lists in mild and moderate-severe distress groups.*Fisher's exact test.

Table 3 .
Comparison of predictive performance for distress in breast cancer survivors using machine learning models.