Survivability Prediction of Breast Cancer Patients Using Three Data Mining Methods: A Comparative Study

Background and aims: Breast cancer (BC) is the leading cause of mortality among women. Early diagnosis is crucial for effective treatment. This study applied suitable data mining methods that provide rules and present influential prognostic factors on the survival time of BC patients. Methods: The dataset consisted of 1574 women diagnosed between January 2002 and December 2012 at the Cancer Registry Center of Nemazi hospital in Fars Province, Iran. Patients were classified based on prognostic factors using three popular data mining methods, including decision tree (J48), Naïve Bayes (NB), and nominal logistic regression (NLR). The Weka software was considered to compare these methods using sensitivity, specificity, and accuracy metrics. The outcome of the study was the median survival time, which was categorized into three classes.


Introduction
Breast cancer (BC) ranks as the most prevalent cancer among women worldwide, standing as the fifth leading cause of death.In 2020, an estimated 684 996 women lost their lives to BC, with 2 261 419 newly diagnosed cases. 1,2Moreover, there has been a decreasing trend in the number of women succumbing to BC since 2007, resulting in an annual 1% drop in the death rate from 2013 to 2018, particularly among women under the age of 50. 1 Several factors influence the risk of developing BC, including determinants of endogenous hormones, such as the early or late onset of menarche and menopause, the late age of first pregnancy, and genetic predisposition.The other influential factors are high intake of the exogenous hormone, physical activity, a healthy diet, smoking, anthropometric characteristics, and family history. 3,4ch female's recovery chance is associated with various variables, including tumor size, the involved lymph node numbers, and other tumor characteristics, implying that estimating the survival chance for the patients might be difficult. 5lthough this cancer ranks as the second main cause of cancer death in women, its survival rate is notably high. 6Therefore, determining accurate prognostic factors affecting a patient's survival is crucial.
Survival analysis, a statistical technique used when the response variable is the time to an event, plays a pivotal role in estimating recovery chances for patients.The Cox proportional hazard model is widely used 7 ; this regression model assumes linear associations between predictors and survival outcomes.Data mining methods offer an alternative by considering all potential interactions and effect modifications between variables. 8,9ecently, data mining methods have been extensively applied in BC diagnosis and treatment. 10They play a crucial role in reducing the frequency of false positives and false-negative results, aiding physicians in decisionmaking, and assisting researchers in identifying disease patterns and predicting outcomes when dealing with numerous variables. 11ecent studies analyzing medical survival data have shown that the decision tree (DT; c5) exhibits the highest accuracy, surpassing 90% when compared to artificial neural networks (ANNs) and logistic regression methods. 12An overview of data mining techniques for BC predictions revealed that the C4.5 algorithm is the most accurate, achieving over 80% accuracy. 12Another study focusing on machine learning algorithms for predicting BC survival rates concluded that the J48 DT model was more sensitive, logistic regression was more accurate, and ANNs had the highest specificity. 12Recent research has demonstrated that the Naïve Bayes (NB) technique surpasses other classifiers in terms of accuracy. 13ohammed et al found that the J48 DT was more accurate, efficient, and effective in predicting BC risks based on evaluation criteria. 14In a comparative study on data mining techniques, it was discovered that DTs and ANN methods could classify data with high accuracy. 15ur study has aimed to apply suitable data mining methods to survival data, providing rules and presenting influential prognostic factors on the survival time of BC patients.Notably, our study differs from previous ones as the target variable has three classes, whereas the majority of recent studies have focused on classifying patients into two categories.

Materials and Methods
A BC cohort study was conducted at the Nemazi hospital Cancer Registry Centre from January 2002 to December 2012.The inclusion criteria involved a BC diagnosis during the study period with no other type of cancer involvement.Ultimately, 1574 patients were included in the study.Patient medical information encompassed nipple involvement (NI), skin involvement (SI), lymphatic and vascular invasion (LVI; LV involvement), progesterone receptor, estrogenic receptor, age, node total, nuclear grade, disease stage, tumor size, marital status, and education.The survival time for each case was computed as the difference between the time that each case entered the study and the time of death for cases that were followed until death, or from the baseline to the closing date of follow-up for living patients. 7In this paper, to predict tumor nature for improved treatment in BC patients and determine influential prognostic factors on BC survivability rates, we focused on three classification algorithms in Weka data mining tools, including J48 decision tree, NB, and nominal logistic regression (NLR), ultimately seeking the most accurate classifier techniques.The target variable was categorized into three classes based on the median survival time for data mining purposes.Patients who survived more than the median survival time and remained alive until the end of the study were classified as the "Above-median" class.The "Below-median" class covered patients who survived less than 4 years with BC, while the "Undetermined category" consisted of BC patients who were alive and had less than 4 years in the study 16 (Algorithm 1).

Data Mining Methods
The NB technique is based on the famous Bayes theorem, assuming strong (naïve) independence between features.In other words, knowing the value of one attribute reveals nothing about the value of another attribute.
The maximum likelihood function in the NB method has a closed-form expression, making it less expensive than other types of classifiers that use iterative approximation.NB, while not a Bayesian method, creates statistical predictive models based on Bayes' theorem. 17ne notable advantage of NB is its requirement for a small training sample for parameter estimation.
The DT (J48) is one of the most popular learning models for the powerful classification of observations.The tree models are called classification trees and regression trees when the outcome variable is categorical and continuous, respectively.The most important features of J48 include DT pruning, handling missing values, continuous ranges of attributes, rule derivation, and the like.
In this method, mathematical algorithms such as the Gini index, the chi-square test, and the like are used to allocate the input observations into subgroups.This process is repeated until the tree is completed. 12,18LR, developed by Joseph Berkson, is a generalization of linear regression.In this model, the log odds for the value of the outcome are a linear combination of predictor variables.
NLR is utilized when the dependent variable has more than two categories.The benefits of the model include a strong statistical foundation, a probabilistic model for completely explaining the observations, high efficiency, interpretability, and the lack of requiring too many computational resources. 19

Evaluation Criteria (Performance Parameters)
The criteria for comparing the results of various data mining tools include sensitivity, specificity, and accuracy.The related definitions and formulas are provided as follows:

Algorithm 1
Setting the survivability dependent variable for 4 years' threshold (median of survival time).if Time ≤ 4 years and alive then the record is pre-classified as "undetermined" else if Time ≤ 4 years and dead then the record is pre-classified as "below median" else if Time > 4 years and dead then the record is pre-classified as "above median" else ignore the record end if Sensitivity (Recall/true positive rate) presents the ability of a classifier to identify the actual positive results.

Sensitivity = TP / (TP + FN)
Specificity (True negative rate) is clear by name; it is the proportion of actual negatives accurately recognized by the classifier.

Specificity = TN / (TN + FP)
Accuracy refers to the ability of the classifiers to correctly predict the class label.

Accuracy = (TP + TN) / (TP + TN + FP + FN)
where TP, FP, TN, and FN denote true positives, false positives, true negatives, and false negatives, respectively. 19n this study, a 10-fold cross-validation procedure was employed to calculate the unbiased prediction accuracy of the applied prediction models.In this technique, the primary sample is partitioned into 10 subsamples of equal size randomly.One of these subsamples is used for testing, and the remaining are utilized for training.This process is repeated 10 times so that every subsample is considered the validation dataset.Finally, a single measure of accuracy for the model is computed by averaging the results from the folds.All classification techniques were run in the WEKA software.

Prognostic Factors
This study investigated the effects of nipple and SI, LVI, progesterone and estrogen receptor (ER) status, age, total nodes, nuclear grade, disease stage, tumor size, education, marital status, and Nottingham prognostic index (NPI) on the status of a patient's survival.The patient's age was initially recorded as a continuous variable, but it was discretized into two classes ( < 35 and > 35 years old).Marital status was classified as single or married.Education level groups were illiterate, primary school, high school, and university.The Tumour, Node, Metastasis (TNM) staging system categorized patients into different stages at diagnosis.Nodal status classes were 0, 1-3, and > 3 nodes involved.Tumor size was grouped into 3 categories ( < 3, 3-5, and > 5 cm).The nuclear grade was classified into 3 levels (i.e., well-differentiated, poorly differentiated, and undifferentiated levels).Estrogen and progesterone receptor (PR) groups were positive or negative.Nipple, skin, and LV involvement were categorized as involved or free.
The NPI for each BC patient is the sum of the values of tumor size multiplied by 0.2, nuclear grade (1-3), and nodal status (1-3).The original NPI was grouped into 3 classes (good, moderate, and poor, with cut-off values of ≤ 3.4, 3.5-5.4,and > 5.4). 7

Variable Selection
To determine significant prognostic factors for improving classification accuracy, univariate Cox regression models were applied, and factors with P < 0.20 were selected for entry into the final model.Prognostic factors included age, tumor size, SI, ER status, PR status, NI, nodal status, nuclear grade, and LVI.According to the NPI formula, it is evident that nuclear grade, tumor size, and nodal status are correlated with NPI.Additionally, there is a strong correlation between disease stages and nodal status.Consequently, NI, SI, ER status, PR status, LVI, age, and NPI were selected for entry into the classification models. 6he selected variables and their classes are listed in Table 1.

Results
By December 2012, 212 women (13.5%) had died due to BC.The mean age at diagnosis was 49.74 years old.Overall survival rates at 2, 3, 5, and 10 years were 0.98, 0.94, 0.87, and 0.76, respectively.The mean and median survival times were 4.81 and 4.27 years.Sensitivity, specificity, and accuracy for J48 were 0.480, 0.570, and 0.572.In addition, the corresponding values for NB and NLR were 0.483, 0.610, and 0.584, as well as 0.488, 0.584, and 0.579, respectively.
In this study, the values of TP, FP, TN, and FN were computed for each class in the confusion matrix, separately due to having three classes.Finally, by using a weighted average, the accuracy, sensitivity, and specificity were computed for every data mining technique, considering that the assigned weight for each class is its size (Table 2). 20pecificity and sensitivity can be computed by using confusion matrix information (TP, FP, TN, and FN). 21he results of our approaches are reported in Table 3.
Table 3 presents the evaluation criteria of the three classification algorithms used for the BC data set.The values show that the best classifier was NB due to the highest values of accuracy and specificity (58.4% and 61%).The second place was filled by NLR, and the third was J48.The J48 data mining method gives some additional information by building classification models in the form of tree structures.The most influential predictor is at the top of the DT.We concluded that the NPI was the most influential predictor in the survival status of BC patients.Other influential predictors were LVI, PR, ER, age, NI, and stage.The reported numbers in the DT in Figure 1 demonstrate the prediction process.For example, 3 (24/10) in the box implies that at this path, the prediction was level 3, and the (24/10) means that 24 observations in the dataset ended up at this path and 10 were incorrectly classified; in other words, 14 and 10 had the label 3 and label 1 or 2, respectively.Some leaves in the DT had float numbers; when the instance has missing attributes, then the classifier (J48) does not know the way of the tree for that attribute.Thus, the classifier will divide the instance according to the probability and percentage of the instance.

Discussion
In this paper, the Nemazi hospital BC dataset was employed to evaluate the accuracy of three popular prediction models, including one from statistics (NLR) and two from machine learning (J48, NB).NB was the best classifier, while J48 was the least effective classification method.
Based on our results, it is confirmed that applying advanced data mining methods leads to high predictive accuracy.However, some issues exist regarding data collection, data mining methods, and predictive ability that should be taken into consideration.Two crucial aspects of predictive accuracy are data size and quality. 22edical data typically exhibit heterogeneity, making the application of data mining techniques challenging.Issues such as missing data, redundancy, imprecision, and inconsistency can influence the results of data mining methods.Additionally, data-gathering methods may introduce noise. 23,24lthough data mining methods have some drawbacks, such not fulfilling classical statistical conditions, 25 they can potentially be essential tools in medicine, exploring aspects of diseases and providing valuable information for future research. 26,27The acceptable and good predictive accuracy of our applied models is just one factor emphasizing the importance of data mining methods in the medical field.
Recent studies on the survival of BC patients have employed different analytical methods, including data mining methods and the Cox regression model, the traditional statistical model for analyzing survival data.The most common data mining techniques used in previous studies were J48, NB, and ANN.][30][31] Recent research has shown that NB was the most accurate method, 13 which is consistent with the results of our study.
The secondary objective of our study was to determine prognostic factors.Based on our results, using suitable predictive attributes led to the development of a model for accurately predicting outcomes.In medicine, these models can be applied in prediction, diagnosis, and treatment. 27,32mportant influential attributes were identified based on J48.The first and most important was NPI.It is a traditional and widely used method for predicting the survival of BC 33 and is also used to provide a basis for assessing newly designed methods for prognosis in BC, including microarray techniques. 34Other important predictors were LVI, PR, ER, age, NI, and stage.Our findings are in line with those of previous studies.In a recent study on survival prediction using DTs and logistic regression analysis, LVI and stage were identified as influential factors. 35Tanha et al investigated the relationship among prognostic indices of BC using classification techniques in 2020.In this study, PR, ER, and age were identified as significant prognostic factors. 36NI was one of the significant independent prognostic discriminants in pathologic findings from the National Surgical Adjuvant Breast Project. 37nhancing accuracy and precision in prediction is feasible through various measures.These measures encompass modifying the size of variables, diminishing the number of features, or choosing the most dependable Survivability prediction using data mining features through applying robust algorithms such as principal component analysis for feature selection.Moreover, modifying the techniques utilized for data preprocessing, adjusting runtime parameters, and employing ensemble methods with varying parameters can potentially enhance precision and accuracy rates. 38onsidering the large amount of data available in medical databases and the potential significant association between symptoms and diagnosis, applying data mining algorithms to explore these relationships is advantageous.However, data mining is not intended to replace medical professionals but rather to enhance their efforts in saving human lives.Medical researchers and statisticians must examine the availability of their biological data concerning variables associated with cancer survivability prediction.Variables in this study were selected using the literature on computational biology and the available BC dataset, along with the researcher's domain knowledge.Data quality has the potential to determine the outcome of a machine learning method, either leading to success or failure.This crucial step accounts for a significant portion, ranging from 60 to 80%, of the overall data mining or machine learning procedure. 39

Limitations of Data Mining Methods
Despite the potential of data mining to offer valuable insights and assistance to medical professionals through pattern identification, there are limitations to what it can do.Not all patterns discovered through data mining can be deemed "noteworthy"; a noteworthy pattern must possess logical reasoning and be actionable.For instance, while data mining can be useful in diagnosing or suggesting treatment, it is not a proper replacement for a physician's intuition and interpretive abilities.

Conclusion
In this paper, it was attempted to improve the accuracy of BC classification by utilizing data mining methods.Comparing multiple prediction models for BC survivability gave us insight into the relative prediction abilities of different data mining methods.The results suggested NB as the best classifier due to its higher accuracy and specificity.Our experimental results revealed significant relationships between different prognostic indices in the BC dataset.J48 identified the NPI as the most effective prognostic factor.As a future insight, there is still a need for a comprehensive investigation employing data mining methods to determine designs that yield a higher level of precision and accuracy.To make considerable strides forward in the prognosis and medication of BC, continued investigation and assistance between data scientists, medical experts, and researchers is crucial.

Table 3 .
Evaluation Criteria in Three Data Mining Methods Figure 1.Decision Tree of Breast Cancer Patients for Medical Application