Breast cancer recurrence prediction with ensemble methods and cost-sensitive learning

Abstract Breast cancer is one of the most common cancers in women all over the world. Due to the improvement of medical treatments, most of the breast cancer patients would be in remission. However, the patients have to face the next challenge, the recurrence of breast cancer which may cause more severe effects, and even death. The prediction of breast cancer recurrence is crucial for reducing mortality. This paper proposes a prediction model for the recurrence of breast cancer based on clinical nominal and numeric features. In this study, our data consist of 1,061 patients from Breast Cancer Registry from Shin Kong Wu Ho-Su Memorial Hospital between 2011 and 2016, in which 37 records are denoted as breast cancer recurrence. Each record has 85 features. Our approach consists of three stages. First, we perform data preprocessing and feature selection techniques to consolidate the dataset. Among all features, six features are identified for further processing in the following stages. Next, we apply resampling techniques to resolve the issue of class imbalance. Finally, we construct two classifiers, AdaBoost and cost-sensitive learning, to predict the risk of recurrence and carry out the performance evaluation in three-fold cross-validation. By applying the AdaBoost method, we achieve accuracy of 0.973 and sensitivity of 0.675. By combining the AdaBoost and cost-sensitive method of our model, we achieve a reasonable accuracy of 0.468 and substantially high sensitivity of 0.947 which guarantee almost no false dismissal. Our model can be used as a supporting tool in the setting and evaluation of the follow-up visit for early intervention and more advanced treatments to lower cancer mortality.


Introduction
Breast cancer is one of the most common invasive cancers nowadays. According to the World Health Organization (WHO) report in 2018, the breast cancer is the most frequent cancer among women [1]. It impacts 2.1 million women each year and causes the most significant number of deaths among all types of cancers. In 2018, it is reported that an approximate of 627,000 women, 15% of all cancer deaths among women, died from breast cancer [1]. Moreover, according to the American Cancer Society, from 2007 to 2016, invasive female breast cancer incidence rate increased slightly by 0.3% per year. The female breast cancer death rate peaked at 33.2 (per 100,000) in 1989 and declined by 40% to 19.8 in 2017, which was still a high-rate mortality [2]. In Taiwan, breast cancer has the fourth cancer mortality and remains the highest cancer incidence rate in women in 2014.
More and more studies indicate that screening methods, including mammography, ultrasound, and MRI, may reduce breast cancer mortality and also increase the survival rate of breast cancer [3,4].
The mortality of breast cancer can be reduced by 40% for those who take part in screening every 1-2 years [5,6]. Besides, for those diagnosed with breast cancer, patients would be in remission because of earlier detection and improved treatment. According to a survey of breast cancer statistics in 2019 [7], the average 5-year survival rate is approximately 90%, and the average 10-year survival rate is 83%.
Although breast cancer can be in remission by early detection and improved medical techniques, some patients suffer from breast cancer recurrence. Breast cancer recurrence is a fundamental clinical manifestation and it even is the primary cause of breast cancer-related deaths [8]. In recent years, many researchers have tried to find a particular pattern predicting breast cancer recurrence [9]. For instance, by characterizing the presence of breast cancers' receptors, including ER, PR, HER2, and TNBCs, each subtype will have a higher risk of recurrence than others during particular years or in a specific situation [10][11][12]. Furthermore, axillary lymph node metastases are related to breast cancer recurrence [13]. The chances of breast cancer recurrence can be reduced by intervening in the metastases in an early stage. However, these patterns demand considerable cost and are time-consuming.
As a result, we would like to propose a noninvasive computational model to predict the risk of the recurrence of breast cancer. Like [14,15], we make use of patients' clinical and treatment information in Breast Cancer Registry to build a prediction model and evaluate various approaches to achieve our goal. Compared with the patterns mentioned in prior, our model can be used in a clinical application after the treatment of original breast cancer in a low-cost and time-saving setting.
In the medical field, Machine Learning (ML) approaches are emerging techniques to resolve medical issues. For instance, Chen et al. develop an early prediction method which makes use of three-year hospital data to effectively predict chronic disease outbreaks. In the study, Chen et al. utilize both structured data and unstructured data [16]. In another study [17], the author proposes a general disease forecasting approach using the symptoms of the patient. The study utilizes K-Nearest Neighbor and convolutional neural network to predict the disease. Moreover, some significant research studies implement ML algorithms to forecast the recurrence of breast cancer. For instance, the study [15] implements three ML algorithms, including artificial neural networks (ANN), decision tree (DT), and Support Vector Machine (SVM) for breast cancer prediction. The study utilizes the Iranian Center breast cancer data for the prediction. The dataset consists of 1,189 records with 22 predictor variables and also a single outcome variable. In the study, the SVM outperforms other techniques and scores the highest accuracy and minimum error rate. In the study [18], the authors apply the NLP and ML algorithms to obtain features of breast cancer and organize the dataset as a comprehensive database. The study collects data from the King Abdullah University Hospital (KAUH) in Jordan. The data consist of 1,475 patient records which hold 142 breast cancer cases. Subsequently, the authors build a model for predicting the recurrence of breast cancer for choosing proper treatment methods and therapy. The research indicates that the bagging classifier outperforms other classifiers and scores an accuracy of 0.923 and a sensitivity of 0.923 [18]. In the study [19], the authors identify the elements significantly associated with recurrent breast cancer and employ the ANN model to detect the recurrence within ten years after breast cancer surgery. A total of 1,140 patients data is involved in this study. The model scores an accuracy of 0.988 and a sensitivity of 0.954. The research [20] utilizes the DT C5.0 to achieve early detection of recurrent breast cancer. A total of 5,471 independent records are secured from official statistics of the Ministry of Health and Medical Education and the Iran Cancer Research Center patients with breast cancer. In the study, the authors employ some features such as the LN (Lymph Node) involvement rate, HER2 (Human Epidermal Growth Factor Receptor 2) value, and Tumor size for prediction. The model achieves an accuracy of 0.819 and a sensitivity of 0.869.

Dataset
Our dataset has been taken from the Breast Cancer Registry from Shin Kong Wu Ho-Su Memorial Hospital between 2011 and 2016. This dataset consists of 1,061 patients and 85 clinical features, as shown in Appendix 1. Furthermore, merely 37 records, approximately 3.5%, have a recurrence; the data appear to be extremely imbalanced.
Since some particular values represent unfilled fields or inapplicable values, we perform data cleaning to replace those values as missing values. We then perform data preprocessing on the features of "smoking behavior," "betel nut chewing behavior," and "drinking behavior" from a complex nominal data to binary class data in which Class 1 indicates having this behavior and Class 0 denotes oppositely. We also transfer the target feature of "recurrence" from date format to a 'YES' or 'NO' binary class. To be more specific, if there is a date value, we regard it as 'YES'; otherwise, 'NO'.
Moreover, another data mining technique has been useddata integration. It involves combining data from several features and provide a unified view of data. We employ the feature, Body Mass Index (BMI), by integrating height and weight. The formula is: According to the Breast Cancer Registry, there are seven different therapies (i.e., Surgery, RT, Chemotherapy, Hormone/Steroid Therapy, Immunotherapy, Hematologic Transplant and Endocrine Procedure, and Target Therapy), and each of the therapies could be received in the declaration facility or others. In order to observe the relationship between these therapies and recurrence, we first integrate the corresponding features to define seven features that could indicate whether the patient had received this therapy or not. In reference to Appendix 1, we integrate (23)- (32) and (37) to Surgery, (33)-(52) to RT, (53)-(55) to Chemotherapy, (56)-(58) to Hormone/Steroid Therapy, (59)-(61) to Immunotherapy, (62)-(63) to Hematologic Transplant and Endocrine Procedure, and (64)-(66) to Target Therapy. Note that we remove the Hematologic Transplant and Endocrine Procedure since it is not available in the declaration facility or others. As a result, we have six remaining kinds of therapies in this study, as well as the corresponding user-defined features. Then, we perform data preprocessing: If the result turns out to be "YES," we would give a value of 1 in the field; otherwise, value of 0 will be given.
Moreover, we transform the 14 date-related features into the 12 duration features. To be more specific, we take the 11 date-related features as the "start date," It is noted that features such as Reasons for No RT or Reason for No Surgery of Primary Site are not descriptive data, they have already been categorized to nominal data. Take Reason for No RT as example, there are 8 classes to define this feature, noted as '0', '1', '2', '5', '6', '7', '8', and '9'. Each class has its definition. For instance, '1' represents RT is not the priority treatment for the current patient, and '5' denotes current patient expired before having RT.
Furthermore, we introduce the approach and illustrate the process flow of the system architecture in Figure 1. Starting from the left-hand side, we perform data preprocessing, including handle missing values, data transformation, and data integration (detailed in this section), and feature selection (detailed in Section Feature Selection) on the dataset. After splitting data into training data and testing data, we apply resampling techniques, SMOTE and Under-sampling, on training data to solve the problem of data imbalance (detailed in Section Resampling). We then apply two classification algorithms, AdaBoost and Cost-sensitive learning, to build our model (detailed in Section Classification Algorithm). Finally, we employ the k-fold cross-validation to evaluate the results with six metrics, including accuracy, sensitivity, precision, specificity, ROC area, and F-measure (detailed in Section Evaluation).

Feature selection
We apply the feature selection approach including the Correlation-based Feature Selector and the Best First Search to reduce the computation overhead of massive data.
The Correlation-based Feature Selector (CFS) is a filter algorithm that evaluates the worth of feature subsets in which the subsets are highly correlated with the class and having low intercorrelation at the same time. The CFS is based on a correlation-based heuristic evaluation function and the feature that is accepted depending on the entire instance.
The Best First Search (BestFirst) is a searching strategy that searches the space of feature subsets by greedy hillclimbing and a backtracking ability. The BestFirst moves through the entire space by deciding on the present feature subset; once the promising of the path decreases, the feature subset will backtrack to the previous subset and proceed with the task.
When implementing a preprocessing step for ML, this assembly of feature selection, CfsSubsetEval and BestFirst, has been found to perform the best [21].

Resampling
As the recurrence of the breast cancer dataset is imbalanced, we apply resampling techniques on training data to handle the disproportionate ratio of observations in each class and to enhance the class boundaries. In our experiments, we perform Under-sampling and Synthetic Minority Over-sampling Technique (SMOTE) [22] in several different proportions. Under-sampling removes some observations of the majority class, while SMOTE generates new and synthetic data by using the nearest neighbor's algorithm.

Classification algorithm
Various kinds of ML algorithms solve the classification tasks. According to a prestigious ML competition, KDD Cup, the ensemble method placed first in last 13 years (2005-2018) [23][24][25][26][27][28][29][30][31]. It also dominated in other competitions, the Netflix Competition [32] and Kaggle [33]. The ensemble method improves performance by combining several base learners into one prediction model. This result is also applied in our previous paper [4]. Moreover, the ensemble method has been proved to be robust to handle class imbalance [34][35][36][37][38], which also appeared in this study of Breast Cancer Registry.
Among all ensemble learning algorithms, AdaBoost [39] (Adaptive Boosting), proposed by Freund and Schapire, is one of the most important algorithms. According to a study in [40], AdaBoost has a solid theoretical foundation, which produces extremely accurate prediction with incredible simplicity and has a wide range of successful applications. Furthermore, AdaBoost is robust, which dominates over outliers or noisy data and avoids overfitting problems, so it is also known as the best out-ofthe-box classifier [41,42]. The AdaBoost combines the classifiers from the weak learners on various distributions to make itself strong and thus drastically improves the performance. Therefore, we choose AdaBoost as our classifier algorithm to achieve better performance.
The cost-sensitive method [43][44][45] is a type of learning in data mining which aims to get minimal cost class results on an imbalanced dataset. By re-weighting the cost matrix, the classifier will attempt to make decisions on the fewer weight cases and avoid predicting the highcost cases. In our experiments, we expect the model to make fewer error predictions on the recurrence class, which is the false-negative case. Since the consequences of the misjudgment for facing the recurrence would be too expensive, a higher penalty will be given to the weight of the false-negative case in order to achieve approximately 100% sensitivity.

Evaluation
In the experiments, we employ k-fold cross-validation to evaluate the performance of the model. We first randomly divide the dataset into k equal sized partitions. For each unique partition, we take it as the validation dataset for evaluating the model, and the remaining (k −1) subsamples are considered the training dataset. Afterwards, average the results from the k times process of cross-validation. In our work, we set k as 3. The first fold includes 342 no-recurrent and 12 recurrent records, the second fold contains 341 no-recurrent and 13 recurrent records, and the third fold consists of 341 no-recurrent and 12 recurrent records. Moreover, accuracy, sensitivity, precision, specificity, ROC area, and F-measure will be reported to evaluate model performance and defined as follows. We use the confusion matrix, shown in Table 2, to describe the evaluation metrics for better understanding.
• Accuracy measures the ratio of correct predictions over all evaluated cases. Accuracy  TP TN  TP FN FP TN (2) • Sensitivity measures the fraction of positive actual cases that are correctly predicted.
• Precision measures the proportion of positive predictions that are positive actual cases.
• Specificity measures the fraction of negative actual cases that are correctly predicted.
• ROC area stands for "Receiver Operating Characteristic Area," also known as "Area Under the ROC Curve" (AUC). It measures the performance as a relative tradeoff between Sensitivity and Specificity. • F-measure is the harmonic mean of Sensitivity and Precision. The higher the F-measure, the better the predictive power of the model.

Results
Our approach consists of three stages. We first perform data preprocessing and feature selection which have been detailed in Section Dataset and Feature Selection, respectively. The statistics of selected features of our dataset are shown in Table 3. Moreover, the six-selected features are described as follows: • Regional Lymph Nodes Positive records the total number of regional lymph nodes tested positive by the pathologist. It can be used to evaluate the quality of a pathology report, the extent of surgery, and the measurement of treatment quality.  • Clinical N refers to whether there is regional lymph node metastasis and the scope of metastasis. It is used to carry out prognosis estimation, treatment planning, evaluation of new therapies, result analysis, followup planning, and early detection results evaluation. There are 11 classes in this feature, including 'NX,' 'N0,' 'N1,' 'N2,' 'N2a,' 'N3,' 'N3a,' 'N3b,' 'N3c,' 'no suitable definition,' and 'N/A (missing value).' There are three stages in our approach. The first stage is data preprocessing and feature extraction. Among the eighty-eight features, the six features (including Regional Lymph Nodes Positive, Duration of First Contact, Tumor Size, Cancer Status, Response to Neoadjuvant Therapy, Clinical N) are selected. We are wondering whether the model of using only six-selected features downgrades the performance of prediction model. As a result, we provide the Table 4 to support our methodology.
In reference to Table 4, we summarize the performance of prediction model by using all features and six-selected features. Considering the accuracy, both models are almost the same. Moreover, the model of six features achieves higher precision and ROC area, but lower sensitivity.
Take the results as input for the next stage. In the second stage, we implement different ratios of resampling techniques, including under-sampling and SMOTE, and apply AdaBoost to construct the model. The second stage results are shown in Table 5. As the ratio of recurrence to no-recurrence is three to one, the F-measure is 0.657 which is the highest among all experiments, and the accuracy and sensitivity is 0.973 and 0.675, respectively.
In the third stage, we combine AdaBoost and costsensitive methods to build a model with high sensitivity and acceptable accuracy. The performance of the third stage is reported in Table 6. Our model achieves accuracy of 0.468 and sensitivity of 0.947.
In the medical application of imbalanced data, it is challenging to build a prediction model of having both high sensitivity and precision. There is a trade-off between sensitivity and precision.
Therefore, in this study, we provide two alternatives of achieving high sensitivity and high precision, respectively. First, we build a prediction model of having high precision by using only the six features, as shown in Table 4. Then, we build a prediction model with resampling techniques, as shown in Table 5.
However, with respect to cancer recurrence prediction, the prediction model would be expected to have high sensitivity but reasonable precision. The cost of misclassification of false negative might not be affordable. As *Data are presented as number (%) or mean ± std dev.
a result, we build another prediction model of having high sensitivity by using cost-sensitive learning methods, to guarantee almost no false dismissal of recurrence prediction, as shown in Table 6. When dealing with the class imbalance problem in the medical application, we may make use of the costsensitive learning algorithm by setting a cost matrix which encodes the penalty of misclassification. A costsensitive classification technique takes the unequal cost matrix into consideration during model construction and generate a model of the lowest cost. In this study, the penalty is the cost of committing false negative error.
The setting of penalty in the cost-sensitive method is reasonable when applying prediction algorithms in the medical applications, since it would not be affordable for the false negative case. The 'recurrence cases' are rare cases, but cannot be missed in the context of prediction. In the medical prediction, the false negative errors are most costly. In this study, we make use of cost-sensitive methods to reduce the errors by extending their decision  The bold values are the largest values with respect to the corresponding column. boundary toward the negative class, in order to achieve a high sensitivity.
In reference to Table 6, as setting the penalty of 130, the sensitivity is 0.947 and the ROC area is 0.907. That is, our proposed method would guarantee almost "no false dismissal," although it may raise some false alarms.

Discussion
Some discussions based on the experiment results are given below. First of all, our study employs data preprocessing and integration to obtain information for clinical examination. Moreover, we apply feature extraction algorithms to determine most essential features among all features in our dataset of Breast Cancer Registry. As a result, the six features shown in Table 3 are chosen in which the Duration of the First Contact is also selected by the feature selection algorithm. The selected features conform to Dr. Chung-Ho Hsieh's clinical experience in the recurrence of breast cancer. In addition, six-selected features achieve almost the same performance of using all features in terms of accuracy.
The Duration of First Contact could be approximately interpreted as the "Disease-Free Interval" which is one of risk factors of cancer recurrence [46]. In addition, according to [10], the risk of breast cancer recurrence will reach a peak in the first two years and then decrease gradually. Meanwhile, the average Duration of First Contact with respect to the 'Recurrent' patients is 3.08 years in our dataset. The slight difference between the two reports will be further studied in our future work.
In Section 1, we quote "For instance, by characterizing the presence of breast cancers' receptors, including ER, PR, HER2 and TNBCs, each subtype will have a higher risk of recurrence than others during particular years or in a specific situation [10][11][12]." In our study, we have investigated the performance of prediction model by using all features first. The features of ER, PR, HER2, and tumor size are also included in our dataset of breast cancer registry. Then, by applying the feature selection procedure, the six features are chosen to achieve better performance (in terms of ROC area and precision) without sacrificing accuracy.
In addition, we perform experiments by using ER, PR, and HER2. Note that we do not have TNBCs in our dataset of breast cancer registry. According to our experiment results, the accuracy of using ER, PR, and HER2 is not as good as those of using the all features or six-selected features. In more detail, when applying the model of using ER, PR, and HER2, all instances are classified as negative cases. That is, the model has no predictive power.
The resampling techniques play a crucial role while building the model for imbalanced data. The study utilizes various approaches to tackle the variance of the dataset. Initially, we have implemented SMOTE to reduce the variance of the dataset. Ensemble methods are also an alternative approach to handle this imbalanced dataset. Accordingly, to construct a strong model, we have employed the AdaBoost ensemble method.
Applying the cost-sensitive method shows the tradeoff between accuracy and sensitivity. In the beginning, as we set equal cost, the accuracy is 0.973 and the sensitivity is 0.675. If we slightly increase the penalty of the costsensitive algorithm, the accuracy will be down to 0.811 and the sensitivity will be up to 0.754. When the penalty is set to 130, our model has the sensitivity of 0.947. In the third stage, we meet our goal of building a prediction model with high sensitivity and reasonable accuracy in order to assist the early diagnosis, treatment choice, and determination of follow-up visit frequency. Recently, some approaches have been proposed for recurrence prediction of breast cancer, which are described in the Section 1. In reference to Table 7, we summarize the performance of breast cancer recurrence prediction methods. At first glance, it seems that our approach does not outperform the ANN [19]. However, we would like to point out that our dataset is highly imbalanced. The percentage of recurrence in our dataset is 3.5%; the baseline of our dataset is considerably high. The "baseline" is calculated by dividing the number of data in the category with the largest number by the total number of the dataset. From the perspective of performance over baseline, our approach performs well with respect to the highly imbalanced dataset.

Conclusion
This paper proposes a ML approach to build a noninvasive computational model for predicting the risk of breast cancer recurrence using imbalanced data. As the result, our models could be able to serve in a clinical application of early diagnosis, to predict the risk of the recurrence after the treatment of original breast cancer. Early prediction can help with early diagnosis and prevention of the cancer recurrence. Based on our model, physicians can take the prediction results as reference in deciding treatment methods that provide extra support for better decision making.
We use patients' clinical data and solve the problem of data imbalance by employing resampling techniques, cost-sensitive learning, and ensemble methods. We construct two prediction models. The first model performs a high accuracy and reasonable sensitivity, while the second model performs oppositely. With our approach, the first model is able to achieve accuracy of 0.973 and sensitivity of 0.675 and the second model guarantees almost "no false dismissals," which means the sensitivity is approximately 100%. The accuracy and sensitivity will be 0.468 and 0.947, respectively.
Funding information: This study is financially supported by the Ministry of Science and Technology, Taiwan, under Contract No. 108-2221-E-030-013-MY2. The funders did not take part in study design, data collection and analysis, decision to publish, or manuscript preparation.

Conflict of interest:
The authors have no conflicts of interest to declare.
Data availability statement: Due to the nature of this research, the datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request.