Breast cancer diagnosis using feature extraction and boosted C5.0 decision tree algorithm with penalty factor

: To overcome the two class imbalance problem among breast cancer diagnosis, a hybrid method by combining principal component analysis (PCA) and boosted C5.0 decision tree algorithm with penalty factor is proposed to address this issue. PCA is used to reduce the dimension of feature subset. The boosted C5.0 decision tree algorithm is utilized as an ensemble classifier for classification. Penalty factor is used to optimize the classification result. To demonstrate the e ffi ciency of the proposed method, it is implemented on biased-representative breast cancer datasets from the University of California Irvine(UCI) machine learning repository. Given the experimental results and further analysis, our proposal is a promising method for breast cancer and can be used as an alternative method in class imbalance learning. Indeed, we observe that the feature extraction process has helped us improve diagnostic accuracy. We also demonstrate that the extracted features considering breast cancer issues are essential to high diagnostic accuracy.


Introduction
Breast cancer is one of the most deadly diseases for women around the globe [1]. The issue of breast cancer was predicted to be doubled to 1.6 million by the year 2025 [2]. Until now, the cause of breast cancer is still not known to doctors. Early diagnosis of breast cancer is the only way to ensure a long survival of the patients [3,4]. Hence, if the earlier the tumor detects before spreads, the greater hope it cures. Therefore, accurate diagnosis of breast cancer has become one of the important and urgent problems in medical science fields.
Several machine learning studies are commonly used to improve classification accuracy. Chen et al. [5] used rough set-based SVM classifier and improved the accuracy to appropriate 97% with different features combination which had a clue to physicians. Li et al. [6] proposed a novel supervised dimensionality reduction method that can preserve the relationship of the data, and the parameters have become increasingly popular with techniques [22,23], such as PCA and LDA. Thus, in this paper, we combined feture extraction and boosted C5.0 ensemble algorithm with penalty factor to improve performance. To do so, we can improve the classification performance.
The proposed method is called the PCA and boosted C5.0 with penalty factor (P-Boosted C5.0), where it refers to the three stages in this study. First, PCA is used to transform the original feature subset into a new smaller feature subset. To the best of our knowledge, PCA is popular, simplicity and traditional algorithm for feature extraction. Second, boosted C5.0 algorithm is used as classifier, as boosted C5.0 refers to a general and practical is an effective solution to deal with class imbalance problem, is a reasonable approach to leverage the strength of individual classifiers. Third, penalty factor matrix is employed to impact the classification results, since it represents a balance between maximizing the classification interval and minimizing the classification error. The proposed algorithm is evaluated on famous UCI breast cancer datasests, and the experimental results show its effectivenss and efficieny.
The main contributions of this paper can be summarized as follows: 1) PCA as feature extraction algorithm is used for extract the optimal feature subset; 2) Boosted C5.0 is used as ensemble learning approach for classification to further improve the performance; 3) penalty factor matrix is used to adjust the result by adding a high misclassification cost to the minority class; 4) The empirical results on WDBC datasets reveal the effectiveness of PCA-Boosted C5.0.
The remainder of this paper is structured as follows: Section 2 describes the proposed PCA-Boosted C5.0 algorithm. In Section 3, we present the experiment results and compare them with several other traditional algorithms. The discussion and conclusions are presented in Section 4.   Figure 1 shows the block diagram of the proposed PCA-Boosted C5.0 algorithm. This method consists of three stages. The first step is based on feature extraction method. For instance, PCA is used on the dataset to extract optimal feature subset, and thereby good feature subset leads to high classification accuracy. For instance, in order to reduce dimensionality of features, we use PCA to extract features, as the resulting contribution of estimated principal components is calculated, and those whose contribution less than 10% to the total are eliminated to improve the accuracy of the breast cancer predicion. The second stage is to perform the ensemble classification algorithm on the subset. The obtained subset which obtained in the first step is given as input to a second-stage learning model. Cost sensitive matrix is employed to adjust the classification result in the third stage. Specifically, cost sensitive methods consider high-cost weights for minority class.

PCA for feature extraction
Feature extraction is a crucial factor for computational systems applied to diagnosis [24]. Improving the feature selection performance could improve the classification performance. We employ PCA in our search as a preprocessing step for enhancing classification effectiveness. PCA is a popular unsupervised linear technique which attempts to transform the original feature sets which include a large number of features to a new smaller feature space, so that the current data can be experssed with a few number of features variable. First of all, we use a normalize function Eq (2.1) to rescale the features' values to a standard range between 0 and 1 since different intervals of features' value were present, so can be measured in a single standard.
The details of PCA are shown in Algorithm 1. To represent the raw feature vectors with low-dimensional ones, what needs to be done is to compute the first k(k <= n) eigenvectors which correspond to the k largest eigen values. To select the number k, a threshold ϑ is introduced to denote the approximation precision of k largest eigenvectors.
Given the precision parameter θ, the number of k eigenvectors can be decided.
After the matrix V is decided, the low-dimensional feature vetor of raw ones are determined as follows: Thus, with PCA the maximum variance is explained by the first principal compont, after that the second variant is calculated, it orders the principle components so that those with the largest variation come first, and eliminate the features which contribute least to variation.
In our datasset, The screen plot of the main components by feature extraction algorithm of PCA is shown in Figure 2. The red color line changes in Figure 2 tend to be stable after the 9th principle component, which indicates 9th principle component has reached most of the original data. Therefore, we can transform the original feature data into a quantitative structure for training convenience. In practice, the first 9th principal components are chosen as inputs to a second-stage classification algorithm. Here, the benefit of feature extraction is that the information can be maintained as the original data avoiding iteration combination.

Improved C5.0 decision tree for classification
Decision tree, which is the most fundamental and widely used classification method in machine learning field. C5.0 decision tree is an improved top-down algorithm of C4.5, and it uses information gain as splitting criteria to build a decision tree. The criteria of C5.0 is Gain Ration which is a modification of the information gain. The benefit of C5.0 is noticeably low error rates, less memory, and high optimization. Therefore, C5.0 algorithm is more accurate and much faster. C5.0 has tree like structures, prunes the original decision tree, and creates decision tree in the way of "divide and rule". In addition, the most improvement in C5.0 is boosting technique.
Boosting is a simple and effective ensemble learning method for producing accurate classifiers. The principle of boosting algorithm is repeatedly calling weak learners and giving these weak learners high weight vote value. By doing so, the training process can focus more on the cases that caused error, which tends to reduce bias. With respect to C5.0, the most critical feature of C5.0 is boosting technique, and another is the construction of a cost-sensitive. As mentioned above, the boosting and cost-sensitive technique can provide superior in accuracy of the overall performance. As for this, in this work, we propose a novel breast cancer automated diagnosis method, which employs PCA for feature extraction, boosted C5.0 for classification, and cost sensitive matrix for adjusting classification results. In our proposed approach, we not only consider the classification performance but also the unequal misclassification costs of tumors. In our experiment, PCA feature extraction algorithm was employed to achieve the optimal feature subset that leads to the optimal classification performance to improve the overall performance. Then, boosted C5.0 was used as classification algorithm. Lastly, cost matrix was used to adjust the classification results.

Decision making trade-off with cost sensitive matrix
In order to solve the imbalance problem, the paper adds a misclassification cost into the weight of instance. To best of our knowledge, the cost associate with missing a cancer case (false negative) is much higher than those of mislabeling a benign one (false negative). Specifically, false negative cases may spend more cost associated with unnecessary biopsies for pathological analyses, but in false positive cases, the patient may miss timely treatment and lead to death. In other words, Cost sensitive methods consider different cost weights for majority and minority classes. This attempt is more beneficial for the final classification boundary away from the minority class, then enhances the absolute classification accuracy, especially for the minority class. Consequently, we aim to use a matrix of costs associated with possible errors to adjust classification results. In this paper, a cost matrix formed by C × C where C is the number of classes is used. In this paper, cost sensitive matrix is provided in Table 1. A value of 4 in the matrix indicates that the cost of predicting a patient as healthy (false negative) is four times the cost of predicting health as patient (false positive). This value is suggested by research.
In this paper, cost sensitive matrix is provided in Table 1. A value of 4 in the matrix indicates that the cost of predicting a patient as healthy (false negative) is four times the cost of predicting health as patient (false positive). This value is suggested by research.

Experimental results and discussion
In order to evaluate the performance of hybrid approach on imbalanced datasets, we test the proposed algorithm on WDBC and WBCD datasets. The experiments are performed on R version x64 3.2.5 on a PC with an Intel(R) Core(TM) i3 − 4130 CPU (3.40 GHz) with 4 GB of RAM, using Windows 10 operating systems. The P-Boosted C5.0 algorithm was implemented with C50, caret, e1071, kernlab, ROCR, gplot and gmodels packages of R. Note that packages with default setting were used.
To test the effectiveness of the proposed P-Boosted C5.0 for breast cancer diagnosis, two standard breast cancer datasets are applied. In addition, to assert the contribution and significance of the proposed algorithm, the proposed algorithm was compared with some of the previous results reported by earlier methods in literature. Meanwhile, to evaluate the effectiveness of our proposed method, we compare the result of P-Boosted C5.0 with two well-known classifiers on two standard breast cancer datasets. In addition, in order to make the observation more convincing, we conduct 10 independent runs of experiments for each partition, and the average classification performance results are computed, respectively.

Dataset
We used real-world Wisconsin breast cancer dataset (WDBC taken from the UCI machine learning repository) in our experiment. This dataset is commly used among researchers who used for breast cancer classification, so it can provide us easily to compare the performance of our method with that of literature methods. The WDBC include 569 observations and 32 patient attributes, which include 30 tumor feature, and an ID and one class label. Tumor features were collected from a digitized image of a fine needle aspirate. The ten main variables used to predict benign or maligant cases were 1) radius, 2) texture, 3) perimeter, 4) area, 5) smoothness, 6) compactness, 7) concavity, 8) concave points, 9) symmetry and 10) fractal dimension. 212 samples of the dataset belong to malignant class and 357 are of the dataset are of malignant class. Specifically, the information of each dataset is summarized in Table 2.

Evaluation metrics
According to Raeder [25], evaluation measures play an important role in assessing classification performance. Generally, the class of methods usually adopts accuracy as the performance evaluation index. But in class-imbalanced scenario, the overall accuracy as evaluation criteria is not so meaningful since the classification interest is often the minority class. Actually, for two-class imbalanced problems, the class success is typically measured by the geometric mean of true positive and true negative rates [26] which G-mean represents. Thus, in this paper the G-mean are adopted as the performance metric for evaluating imbalanced learning classifier. It is better indicators to show the performance trade-off between classes than overall accuracy for their imbalanced class distribution.
The evaluation indicators are computed based on evaluation metrics derived from the binary confusion matrix presented in Table 3. Where TP, TN represents correctly classified the instances as benign or malignant, and FP, FN represents incorrectly classified the instances as benign or malignant. The calculation formulas are defined as follows:

Experiment results
In order to perform a comprehensive comparison of proposed algorithm in handling breast cancer problem, we conducted experiments on 1 UCI data sets. They are well-chosen and concluded in Table  2. In our experiment, it includes two parts (i) the classification performance is compared, revealing the importance of extracting features, (ii) we perform a comprehensive comparison of all algorithms in each dataset. Figure 2 reports the result of PCA for WDBC dataset. In the figure, we observe that the 9th feature subset has the same discernibility as the original set of fetures. Therefore, utilizing feature extraction algorithm is the key to simplifying the part of the data processing phases and improving the performance by choosing significant features.
In contrast, to evaluate the performance of the proposed ensemble approach, we compare the results P-Boosted C5.0 with P-SVM, P-NB, RUSBoost [27] and SMOTE-Boosted C5.0. First, performance comparison of P-Boosted C5.0, P-SVM and P-NB to show the superior performance of Boosted C5.0 since NB and SVM have been considered as the most effective and common algorithm for breast cancer; Second, the comparison between P-Boosted C5.0 and SMOTE-Boosted C5.0 shows that the benefits of PCA algorithm; Third, P-Boosted C5.0 is compared with RUSBoost which are the state-ofthe-art approach for imbalanced data to show the benefit of the proposed hybrid P-Boosted C5.0 algorithm. Herein, SMOTE-Boosted C5.0 is a classical hybrid algorithm, SMOTE algorithm as sampling method is used to imbalanced the class distribution, Boosted C5.0 algorithm is used as the ensemble classifier. In contrast, the parameter of over and under in SMOTE is set to 100 and 300. Also, trails parameters values of boosting algorithm are sets as 25, an empirical value is suggested by literature.
Moreover, in order to obtain statistically meaningful conclusion 10-fold cross validation is repeated ten times, and average results are presented in SMOTE-Boosted C5.0, RUSBoost, P-SVM and P-NB. Among these algorithms the best classification G-mean is highlighted in bold typeface.  The confusion matrix of P-Boosted C5.0 is listed in Table 4. Table 5 reports the accuracy, specificity, sensitivity, and G-mean of P-Boosted C5.0 and different classification methods for WDBC dataset. As shown in Table 5, P-Boosted C5.0 outperforms other methods where 70-30 partition is performed in terms of G-mean. As it can be observed from the results listed, 95.89% G-mean with nine features is obtained by proposed P-Boosted C5.0 which gets the best performance among all methods. Both theoretical and experimental results show that the combination of hybrid P-Boosted C5.0 is a promising system.
To further validate the performance of the proposed P-Boosted C5.0 algorithm, the comparisons are also conducted with literature methods and several base classifiers, such as naïve Bayes NB. It is noticeable that, for fair comparison, NB are directly reported as benchamark binary classification method without any feature extraction prior actions. To introduce some more novel and advanced strategies for comparison, we adopted some recent methods, such as the IGSAGAW-CSSVM [28], RIPPER [29] and MaxE [30].
Finally, Table 6 illustrates the performance of the comparison methods mentioned earlier. The symbol is given as "-" which means we do not get data from literature. From the results of Table 6, the proposed P-Boosted C5.0 obtained the highest performance among the classifier reported in the literature [22][23][24][25]. The best G-mean achieved by the Aisl method is 97.28%. There may be two main reasons. First, feature selection is employed in literature method, which can identify the significant features and eliminate the irrelevant to improve the classification performance; However, our method of P-Boosted C5.0 uses feature extraction which tranform the original feature sets a new smaller feature space. Thus, this method disturbs the original data distribution, in some content it brings some noisy data; Second, performance of learning algorithm can be impacted by different factors, such as feature space characteristics and parameters. Nevertheless, the value of trial in P-Boosted C5.0 is suggested by the research which is not appropriate for specific issues. In addition, parameter setting and feature extraction play an essential role in the performance of breast cancer diagnosis.
As it can be observed from the result listed, the classification model performs well for diagnosis of breast cancer, the performance is significantly affected by the feature extaction algorithm and ensemble learning algorithm with penalty factor. However, the deep learning methods have shown promising results in cancer prediction [20,21,32], but it need more time and hyper-parameters. According to the aformentioned analysis, P-Boosted C5.0 is a promising and effective approach with imbalanced daaset with large number of features. Table 6. Performance comparison(%).

Conclusion and future work
Biological data often consist of redundant and irrelevant feature, especially for breast cancer data. As the tumor features can be described as much detail as possible, the redundant information leads to large computation time for tediou calculation but without significant contribution time to the final results. Also, as the number of descriptive tumor features increases, the computational time increase rapidly as well. In this case, feature extraction which can remove irrelevant information into a new smaller feature subset, has becoming a crucial preprocessing step for classification system. Meanwhile, the issue of dealing with imbalanced data sets in breast cancer prediction is still unsolved.
To overcome the class imbalance problem in breast cancer classification and meanwhile keep the optimal new feature subset, a P-Boosted C5.0 algorithm is proposed. P-Boosted C5.0 is a three-step approach that first uses PCA for feature extraction to obtain the new optimal feature subset. Next, the Boosted C5.0 algorithm with fix value of trial is performed for classification. Third, cost senstive matrix is suggested for the penalty factor parameter, which was determined according to literature. Experiments were conducted on WDBC dataset with 569 samples. The experimental results demonstrated the advantages of the proposed P-Boosted C5.0 for solving the imbalance problem.
Future studies shall involve the setting of parameter according to special issues. Also, a deep learning method can be applied with a high-dimensional dataset since the deep learening methods have superiority in performance most time, yet not stable due to the impact of parameters. Thus, in future work we aim to create an adaptive method for setting parameter values in the deep learning method, where the value will be dependent on the minority class.