Global optimization of case-based reasoning for breast cytology diagnosis
Introduction
Case-based reasoning (CBR) is a problem-solving technique that is similar to the decision making process that human beings use in many real-world applications. It often shows significant promise for improving the effectiveness of complex and unstructured decision making. In theory, there is no possibility of overfitting in CBR because it uses specific knowledge of previously experienced problems rather than their generalized patterns. CBR is maintained in an up-to-date state because the case-base is updated in real-time, which is a very important feature for the real-world application. Also, it can explain why it provides a solution by presenting similar old cases. Consequently, it has been applied to various problem-solving areas including engineering, finance, marketing, and medical diagnosis. In particular, CBR is very appropriate for medical applications because the characteristics of CBR fit to medical domains very well. In usual, medical knowledge is incomplete, so medical applications put more stress on real cases than applications in other domains. In addition, the explanation capability of CBR can be more important in medical domains because it can be used as the helpful information source for decision makers (i.e. medical doctors).
Despite its various advantages, CBR has been criticized because its prediction accuracy is usually much lower than the accuracy of other AI techniques, especially artificial neural networks (ANNs). Thus, there have been many studies to enhance the performance of CBR. Among them, the mechanisms to enhance the case retrieval process such as the selection of the appropriate feature subsets (Cardie, 1993, Domingos, 1997, Siedlecki and Sklanski, 1989, Skalak, 1994), instance subsets (Chiu, 2002, Kelly and Davis, 1991, Liao et al., 2000, Shin and Han, 1999, Wettschereck et al., 1997), the determination of feature weights (Babu and Murty, 2001, Huang et al., 2002, Lipowezky, 1998, Sanchez et al., 1997, Skalak, 1993, Yan, 1993), and the number of neighbors that combine (Ahn et al., 2003, Lee and Park, 1999) have been most frequently studied.
One of the state-of-the-art techniques for CBR is simultaneous optimization of these parameters in CBR. Most prior research tried to optimize these parameters independently. However, we can easily imagine that the global optimization model for CBR which considers these parameters simultaneously may improve the prediction results.
This study proposes a novel hybrid approach that optimizes three parameters of CBR simultaneously by genetic algorithms (GAs) – (1) the weights of the features, (2) the training instances, and (3) the number of neighbor cases that combine. To validate the usefulness of our model, we apply it to the real-world case of breast cytology diagnosis via digital image analysis, and review the results produced by our model.
The rest of the paper is organized as follows. Section 2 briefly reviews prior studies, and Section 3 proposes our research model, the simultaneous optimization of feature weights, relevant instances and the number of neighbors that combine by the GA approach. In the next section, the explanation for the research design and experiments are presented, and Section 5 describes all the empirical results and their meanings. In the final section, the conclusions of the study are presented.
Section snippets
Prior research
In this section, we first review the general concept of CBR. After that, we examine the previous research to optimize it. We also review the recent studies regarding simultaneous optimization of several parameters for CBR systems. In the end, we examine the GA approach – the key method for simultaneous optimization – in detail.
Global optimization of feature weights, instance selection, and the number of neighbors that combine using genetic algorithms
This study proposes a novel CBR model whose feature weighting, instance selection, and k parameter of k-NN are optimized globally, in order to improve prediction accuracy of typical CBR systems. Our model employs GA to select a relevant instance subset and to optimize the weights of each feature and the number of neighbors that combine simultaneously using the reference and the test case-base. We call it GOCBR (Global Optimization of feature weights, instance selection, and the number of
Application data
In general, there are three available methods for diagnosing breast cancer: mammography, fine needle aspirate (FNA) with visual interpretation, and surgical biopsy. Among them, surgical biopsy is known to be the most accurate method, however it is invasive, time consuming, and costly. Thus, diagnosis systems based on digital image analysis that allow an accurate diagnosis without the need for a surgical biopsy are considered as a realistic alternative. FNA involves using a small gauge needle to
The results of GA-optimized CBRs
Table 2 shows the finally selected parameters of each model. As a result of GOCBR, we obtain 25 optimal weights of each feature and 176 optimal training instances to maximize the prediction result for the test set. Because there are totally 343 training samples, GOCBR selects about 51.31% from the total case-base as an optimal instance subset. As we can see from Table 2, GOCBR selects fewer instances than FISCBR (51.60%) and FWISCBR (79.02%), but it selects more instances than ISCBR (47.52%).
Comparison of the prediction performances
Conclusions
We have proposed a new hybrid CBR model using GA – GOCBR. Our proposed model optimizes feature weighting, instance selection, and the number of neighbors that combine simultaneously. By selecting optimal instances, it may reduce noises or distorted cases which lead erroneous prediction. Our model may also find appropriate nearest neighbors for CBR by applying optimal feature weights to similarity calculation, which may enhance the prediction accuracy. In addition, it generates prediction
References (33)
- et al.
Comparison of genetic algorithm based prototype selection schemes
Pattern Recognition
(2001) Using decision trees to improve case-based learning
In Proceedings of the 10th International Conference on Machine Learning
(1993)A case-based customer classification approach for direct marketing
Expert Systems with Applications
(2002)- et al.
GA based CBR approach in Q&A system
Expert Systems with Applications
(2004) - et al.
Prototype optimization for nearest-neighbor classification
Pattern Recognition
(2002) - et al.
Nearest neighbor classifier: Simultaneous editing and feature selection
Pattern Recognition Letters
(1999) - et al.
A case-based reasoning system for identifying failure mechanisms
Engineering Applications of Artificial Intelligence
(2000) Selection of the optimal prototype subset for 1-NN classification
Pattern Recognition Letters
(1998)- et al.
Prototype selection for the nearest neighbour rule through proximity graphs
Pattern Recognition Letters
(1997) - et al.
Case-based reasoning supported by genetic algorithms for corporate bond rating
Expert Systems with Applications
(1999)
A note on genetic algorithms for large-scale feature selection
Pattern Recognition Letters
Prototype and feature selection by sampling and random mutation hill climbing algorithms
Proceedings of the 11th international conference on machine learning
Prototype optimization for nearest neighbor classifier using a two-layer perceptron
Pattern Recognition
Case-based reasoning; foundational issues, methodological variations, and system approaches
AI Communications
Hybrid genetic algorithms and case-based reasoning systems for customer classification
Expert Systems
Cited by (80)
Case-based reasoning system for fault diagnosis of aero-engines
2022, Expert Systems with ApplicationsCitation Excerpt :Considering the interactions among attributes, Fei and Feng (2020) used attitudinal Choquet integral to optimize the global similarity with respect to the importance of attributes. Ahn and Kim (2009) optimized attribute weights with genetic algorithms to devise an effective similarity measure and increased the accuracy of CBR in retrieving the most useful cases. Considering the other three phases in the CBR cycle, Zhong, Xie, and Lin (2015) proposed a two-layer model with a random forest algorithm to improve the accuracy of case reuse.
A supervised case-based reasoning approach for explainable thyroid nodule diagnosis
2022, Knowledge-Based SystemsCitation Excerpt :This indicates that different classifiers do not greatly affect the proposed approach. In addition, by comparing Figs. 4–6 and Table 6, it can be clearly observed that the six selected ML models outperform the BCBR approach under most parameter settings, which provides more empirical evidence for the conclusion that the CBR is usually inferior to ML models [38]. As a result, it is reasonable to believe that the proposed approach can free human users from the heavy parameter tuning task when the diagnostic accuracy for TDNs is ensured.
Feature weighting methods: A review
2021, Expert Systems with ApplicationsA case-based reasoning system for supervised classification problems in the medical field
2020, Expert Systems with ApplicationsAn intelligent healthcare system for optimized breast cancer diagnosis using harmony search and simulated annealing (HS-SA) algorithm
2020, Informatics in Medicine UnlockedA novel intelligent classification model for breast cancer diagnosis
2019, Information Processing and ManagementCitation Excerpt :To the best of our knowledge, the common methods for detecting breast cancer are mammography and fine needle aspiration cytology (FNAC), but these diagnostic techniques have demonstrated relatively low reliability for the detection of malignant tumors (Chen, Yang, Liu, & Liu, 2011). In recent years, with the development of artificial intelligence, more and more data-driven intelligent classification approaches have been applied for breast cancer diagnosis, such as Naïve Bayesian (Karabatak, 2015), Neural Network (Bhardwaj & Tiwari, 2015), Support Vector Machine (SVM) (Chen et al., 2011) or other hybrid algorithms (Ahn & Kim, 2009; Peng et al., 2016; Sun, Tseng, Zhang, & Qian, 2017; Gu et al, 2017; Qiu et al., 2017). But the fatal shortcoming of these excellent classification models is that they only pursuit of maximizing the classification accuracy, failing to consider the misclassification costs between different categories.