BREAST CANCER DIAGNOSIS USING WRAPPER-BASED FEATURE SELECTION AND ARTIFICIAL NEURAL NETWORK

Breast cancer is commonest type of cancers among women. Early diagnosis plays a significant role in reducing the fatality rate. The main objective of this study is to propose an efficient approach to classify breast cancer tumor into either benign or malignant based on digitized image of a fine needle aspirate (FNA) of a breast mass represented by the Wisconsin Breast Cancer Dataset. Two wrapper-based feature selection methods, namely, sequential forward selection(SFS) and sequential backward selection (SBS) are used to identify the most discriminant features which can contribute to improve the classification performance. The feed forward neural network (FFNN) is used as a classification algorithm. The learning algorithm hyper-parameters are optimized using the grid search process. After selecting the optimal classification model, the data is divided into training set and testing set and the performance was evaluated. The feature space is reduced from nine feature to seven and six features using SFS and SBS respectively. The highest classification accuracy recorded was 99.03% with FFNN using the seven SFS selected features. While accuracy recorded with the six SBS selected features was 98.54%. The obtained results indicate that the proposed approach is effective in terms of feature space reduction leading to better accuracy and efficient classification model.


INTRODUCTION
Breast cancer is the most common cancer among women.it is also considered as the second most common cancer worldwide (Dhungel, Carneiro & Bradley, 2015).Early detection and accurate diagnosis of breast cancer can tremendously contribute to the reduction of fatality rate and remarkably important for the reduction of its morbidity and mortality (Addeh, Demirel & Zarbakhsh, 2017;Moodley et al., 2018).A cost-effective computeraided detection/diagnosis technique can play a crucial part in reducing interpretation error and provide an automated diagnosis of breast cancer.this can hugely assist physicians by providing a second opinion which can ease the process of making the final decision.In this work, artificial neural network is used to build a classification model for breast cancer diagnosis based on fine needle aspiration modality.The main purpose is to correctly recognize the sample type as either benign or malignant.The Wisconsin breast cancer dataset (WBCD) is used for training and testing the proposed model.The accuracy of the classifier highly depends on the features used for classification.We have used the wrapper feature selection to extract the most useful features for the diagnosis purpose.Our approach shows encouraging results and can be developed in a fully automated cad system.

RELATED WORKS
Several researchers studied the performance of various prediction algorithms in classifying breast cancer data.(Senturk & Kara, 2014) compared the performance of seven different classification algorithms on the WBCD including discriminant analysis, artificial neural network (ANN), decision tree, logistic regression, support vector machine (SVM), Naïve Bayes (NB), and k-nearest neighbor (KNN).these algorithms were tested on all the nine features provided in the dataset.the best performance was obtained using the SVM with accuracy of 96.5%.another study (Barna & Khan, 2019) conducted to test the performance of six different classifiers, namely, logistic regression, linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), KNN, NB and ANN. the researchers concluded that the ANN outperformed the other classifiers when the data was separated in to training and testing with the ratio of 75:25.the reported accuracy of ANN was 97.21%.
The work presented by (Vijayalakshmi & Priyadarshini, 2017) focused on using neural network to classify breast cancer tumors.The effectiveness of two different neural networks were compared, namely, the Radial Basis Function (RBF), and the back-propagation neural (BPN).It was found that the RBF neural network achieved higher accuracy than the BPN.The RBF accuracy was 98.26% while the BPN was 90.42%.Khan et al. (2019) proposed a novel embedded approach based on accelerated particle swarm optimization and cuckoo search named HACPSO.Along with other datasets, they tested their approach on Wisconsin Breast Cancer dataset and achieved 98.08% classification accuracy that outclass the existing methods.Mushtaq et al. (2019) compared five different classification algorithms combined with principle-component analysis (PCA) to classify breast cancer tumor.The highest accuracy of 99.20% was obtained with sigmoid based NB classifier.
Kumar studied (Kumar, 2021) the performance of seven different classifiers (logistic regression, KNN, decision tree, SVM, Naïve Bayes, random forest and ANN) and reported that KNN is the best classifier with 97% accuracy and decision tree classifier is the worst performer with 94% accuracy for breast cancer diagnosis.Another study (Islam et al., 2020) reported that ANN is the best classifier in terms of accuracy (98.57%), and precision (97.82%) compare to SVM, KNN, logistic regression and random forest when tested on WBCD.
In the above mentioned studies, the authors focused on comparing the performance of various classification algorithms without giving importance to find the best feature subset.It was assumed that all the features are equally important for the classification, hence, no feature selection was performed.In medical application especially in cancer classification related problems, it is quite significant to find the best feature subset and also to understand the features dependencies.This will help the domain experts to understand the most effective tumor characteristics and the relationship between them.
On the other hand, some other studies used feature selection methods to find the best attributes which can produce the highest discrimination results.A rule based classification system with PCA was proposed by (Douangnoulack & Boonjing, 2018).Only 7 features were used for the classification and the J48 classifier achieved the accuracy of 97.36% on the WBCD.Kumari & Singh (2018) proposed a system for the prediction of breast cancer.A combination of feature selection using Correlation-Based Measures with classification using several algorithms including linear Regression (LR), SVM, and KNN algorithm.The validation was performed using 10-fold cross validation and the accuracy of the model was 99.2%.Nevertheless, the optimal feature subset was not reported.
Ed-Daoudy & Maalmi (2020) applied Association Rules (AR) to eliminate irrelevant features in the WBCD.Four out of nine features were selected.Several classification algorithms were used.The support vector machine with threefold cross-validation produced the highest classification accuracy (98.00%).

Wisconsin breast cancer dataset
The Wisconsin breast cancer database (WBCD) was created by Dr. Wolberg from the University of Wisconsin Hospitals and was donated and made publicly available online by Mangasarian (WBCD, 1995).The data was collected over a period of two years starting from 1989 until 1991, and it is used as a standard dataset for classification and other machine learning purposes by several researchers.It represents the observations of breast mass cell nuclei obtained by a fine-needle aspiration modality.The cytological samples were converted into digital images in order to extract the characteristics of the cell nuclei using image processing techniques.
Fine needle aspiration (FNA) is a type of biopsy procedure.Basically, it is one of the various modalities used in the process of breast masses diagnosis.In this procedure a small needle (21 to 25 gauge) is used to acquire a sample of the tissue and fluid from the breast (Casaubon, Tomlinson-Hansen & Regan, 2020) Total number of 699 samples are available in the WBCD, each sample represented by nine different nuclei features.Furthermore, the diagnosis of each sample as a benign or malignant was also provided in the dataset.458/699 observations were flagged as benign, while 241/699 were flagged as malignant.The details and the names of the feature set are described in Tab. 1.Each feature has a grade between 1 and 10, where the value 1 indicates that the feature is in most normal condition and the value 10 indicates most abnormal condition.
It has been noticed that there are 16 observations with missing values.All the missing values are related to the Bare Nuclei feature.15 samples are from the benign class, and one sample is from the malignant class.Since the number of missing values are small and in the interest of maintaining data consistency, these 16 samples were removed.

Feature Selection
Feature selection is a commonly used data preprocessing procedure in data classification.It is mainly used for reducing and eliminating irrelevant and redundant attributes from any dataset (Tang, Alelyani & Liu, 2014;Foithong, Srinil & Pinngern, 2017).Additionally, it plays a significant role in enhancing data comprehensibility, data visualization as well as reducing the time to train a classification model, and improves the prediction results (Jain & Singh, 2018).
There exist numerous applications of relevant feature identification techniques in healthcare sector.Filter methods, wrapper methods, ensemble methods and embedded methods are some of the popularly used techniques used for variable selection (Kohavi & John, 1997;Guyon et al., 2008).
In this paper two wrapper feature selection methods are used, namely, the sequential forward selection (SFS) and the sequential backward feature (SBS) selection.The wrapper feature selection methods outperform other existing methods such as filter methods.It finds the most ''useful" features and does optimal selection of features for the learning algorithm (Kumar & Minz, 2014); furthermore, the wrapper methods give more accurate results as it considers the features dependencies (Ang et al., 2015).It has been stated that the Naïve-Bayes learning algorithm is robust when it is used to remove noisy features (Kohavi & John, 1997).This is because the performance of the Naïve-Bayes degrades very slowly as more irrelevant features are added (Kohavi & John, 1997).For that reason, the Naïve-Bayes learning algorithm is used with both SFS and SBS.
In this research, the ultimate goal of performing the feature selection process is not limited to obtaining the highest classification accuracy.However, it is also related to the detection of the most clinically significant features as this optimal set of features can help the specialist objectively focus on these features during a routine manual diagnosis process.
Both SFS and SBS are iterative methods.The SFS starts with an empty set and in each iteration a new unseen feature is added.For each added feature, performance is evaluated using the induction algorithm.Only the feature producing the highest increase of performance is added to new feature subset.Then a new iteration is started with the new generated subset.On the other hand, the SBS starts with full feature set and at each iteration one feature is removed.In both methods the searching process stops when there is no further improvement is detected by the induction algorithm.Fig. 1 illustrates the feature selection process.
According to Fig. 1, the feature selection process for both SFS and SBS starts by generating a feature subset.The performance of the subset is evaluated with Naïve-Bayes using 10-fold cross validation process.For each subset, if the induction algorithm performance increase, the final optimal feature subset is updated.The process continues to evaluate the features until no further enhancement in the performance detected.

Classification
In this research, the ANN is used to classify the breast cancer samples in the WBCD into either benign or malignant.ANN was used intensively in the diagnosis and classification of many medical conditions such as leukemia (Wahhab, 2015), prostate cancer (Wu, Zhuang & Tan, 2020), lung cancer (Hsu et al., 2020), liver cancer (Patsadu, Tangchitwilaikun & Lowsuwankul, 2021) and many others.There are various ANN architectures (Agrawal & Agrawal, 2015).However, one of the most widely used is the multilayer feed-forward neural network (FFNN) with a back-propagation learning algorithm (Zarei et al., 2020).In FFNN, there are a number of parameters need to be tuned in order to obtain the best classification performance.These parameters include the number of hidden layers, the activation function, the number of neurons, the learning rate, and the epochs.The number of hidden layer is set to one as usually single hidden layer is sufficient for various kinds of classification problem (Guliyev & Ismailov, 2018).Regarding the activation function, in fact, there are many options available.Nevertheless, previous researches (Bonakdari et al., 2020;Shenouda, 2006) have established that the sigmoid activation function produced a better result in medical and non-medical applications compared to other activation functions.Hence, for this experiment, the sigmoid activation function is chosen.
The rest of the FFNN hyper-parameters is tuned with the grid search optimization using 10-fold cross validation.The grid search algorithm traverses a given combination of parameters.Later, the parameters resulted in the best performance can be used to train the final model and tested using the test set.In grid search the performance of the model is verified using a statistical method called cross validation (CV) (Liu at al., 2017).The cross validation divides the dataset into two parts, namely, training and validation.On each hyper-parameter combination, the FFNN is trained and the accuracy is verified.Eventually, the model which produced the highest performance is used for the final classification test.Tab. 2 presented the range of each hyper-parameter used in the optimization process.

Hyper-parameter Values
Learning

Performance Measure
The performance of the proposed breast cancer classification model is evaluated using the confusion matrix.It is used to calculate true positive (TP), true negative (TN), false positive (FP) and false negative (FN).Accuracy is the most empirical metric used to assess effectiveness of a classifiers.Other important metrics are the precision and recall.The precision is calculated as the correct positive prediction over all the samples classified as positive, while recall is used to test the classifier ability to identify the positive cases.These metrics are calculated as follow:

RESULTS AND DISCUSSION
The proposed work is performed in three folds, first, the best features were selected using two wrapper methods, namely, sequential forward selection and sequential backward selection.Second, the hyper-parameters optimization using grid search, and finally the FFNN training and testing with optimal set of features.The feature selection and classification analysis were performed using RapidMiner.RapidMiner is an open-source data mining and machine learning tool and it provides the largest coverage of healthcare data mining requirements compared to other tools such as R, Scikit-learn and Spark (Santos-Pereira, Gruenwald & Bernardino, 2021).
From the data presented in Fig. 2, it can be seen that both wrapper methods performance is equal at feature number 3 and feature number 5.However, the optimal performance of SFS was obtained at 7th feature.On the other hand, the SBS reached its highest performance at the 6th feature.The best selected features for both methods are listed in Tab. 3.
Based on this result, the top 7 features selected using SFS were utilized to construct the FFNN model.During the grid search optimization, 18 FFNN architecture was trained and validated using 10-fold cross validation.Each architecture was constructed with different set of parameters as mentioned in Tab. 3.
As shown in Fig. 3, the best result was obtained when number of neurons in the hidden layer was equal to 6 and learning rate equal to 0.01 with 100 epochs.After finding the optimal hyper-parameters, the original dataset was divided into two sets, the training set and the test set with the ratio of 70% for training and 30% for testing.Based on that, and in order to compare the classification results obtained using both feature selections methods, the classification experiment was performed twice.The first experiment considered the 7 features selected using SFS, whilst, the second experiment was performed on the 6 features selected using SBS.The final classification results are presented in Tab. 4, and Tab. 5.
It is apparent from the data presented in Tab. 4 that the FFNN classification performance using both feature subsets obtained using SFS and SBS is almost the same as in both experiments all the benign instances were correctly classified.Nevertheless, there is a slight difference in the classification result of the malignant instances.The SFS subset resulted in classifying 69 malignant instances out or 71, whereas the SBS subset wrongly classify 3 malignant instances as benign.As illustrated in Tab. 5, The overall classification accuracy of FFNN using the SFS features subset is outperformed the accuracy obtained using the SBS features subset.In the first experiment the accuracy was 99.03 % with 100% precision and 97.18% recall.While in the second experiment the accuracy was 98.54 % with 100% precision and 95.77% recall.As shown in Tab. 4, the first experiment classified 2 malignant cases as benign and in the second experiment there are 3 malignant cases classified as benign.Although the number of misclassified instances are insignificant, in medical application classifying an unhealthy case as healthy is more dangerous than classifying healthy case as unhealthy.This is because in the latter situation the patient can undergo further investigation and then the misclassification can be ruled out.
In our experiment, the reason of getting misclassified instances could be due to the imbalance data distribution of the WBCD as the ratio of majority (benign) to minority (malignant) is approximately 1:2.The number of benign instances used in the training phase ( 309) is almost double in size compared to the number of malignant instances (168).Although the majority to minority ratio is not tremendously high, the possibility of the learning algorithm overwhelmed by the majority class could not be ruled out.Nevertheless, the proposed approach to classify breast cancer provides an outstanding accuracy with high precision and recall in comparison with the previous experimental results in the literature.Moreover, the proposed approach identified the optimal discriminative set of features using the SFS feature selection algorithm.It has been found that the two features, namely, Single epithelial cell size and Mitoses are not contributing to the diagnosis of breast cancer using the fine-needle aspiration modality.
In the literature, several methods have been proposed for the diagnosis of breast cancer based on the WBCD.Tab. 6 demonstrates the classification accuracy achieved by previous studies compared to our proposed approach.It can be seen that the proposed method outperformed many of the works previously done.On the other hand, the work performed by (Kumari & Singh, 2018;Mushtaq et al., 2019) achieved slightly higher accuracy at 99.28 and 99.2 respectively compared to the proposed method.However, both research works did not focus on finding the most clinically significant.Furthermore, the number of selected features are not reported.Most of the feature selection methods used in the previous experiments focused on filter feature selection methods.Usually filter methods do not consider dependencies between the features.However, in this research the SFS and SBS are used to get the optimal subset of features.The main advantage of these methods is that it will take the interaction between the features into consideration.One interesting finding is that a well optimized FFNN architecture can produce better classification accuracy compared to some of the most popular machine learning algorithms such as SVM and J48.
It is worth mentioning that in medical application especially in cancer classification related problems, it is highly important to select the best set of features that can the optimal model performance.This will help the domain experts to understand the most effective tumor characteristics and the relationship between them.

CONCLUSION
Data mining and machine learning techniques are extensively used to explore patterns in medical data, which can be used for many purposes such as diagnosis and prognosis.Many researches have been conducted in the medical field to accurately diagnose several diseases such as cancer.One of the most important step in the context of computer-aided diagnosis is features reduction.Certainly, there are some features non-informative and redundant features.These features make the classification algorithms ineffective.Hence, features selection will considerably enhance the performance of the classification algorithm.
In this research, an approach to classify breast cancer based on FNA modality has been proposed.Feature selection is a prominent process used for improving the overall classification accuracy as well as understanding the tumor characteristics.In this work, two wrapper feature selection methods, namely, the SFS and SBS were used to extract the optimal subset of tumor attributes.Using SFS, seven important features have been identified from the original nine feature set.Afterward, the FFNN classifier was optimized, trained and tested on the WBCD.The proposed approach performance is evaluated and compared with other previous works.The seven features selected using SFS produced the highest accuracy of 99.03.This research demonstrated wrapper feature selection methods such as SFS can be used for removing the less important features and the proposed FFNN model can be used to obtain efficient automatic diagnostic systems.

Fig. 3 .
Fig. 3.The performance of the three different learning rates with six hidden neurons