Global optimization of case-based reasoning for breast cytology diagnosis

https://doi.org/10.1016/j.eswa.2007.10.023Get rights and content

Abstract

Case-based reasoning (CBR) is one of the most popular prediction techniques in medical domains because it is easy to apply, has no possibility of overfitting, and provides a good explanation for the output. However, it has a critical limitation – its prediction performance is generally lower than other AI techniques like artificial neural networks (ANN). In order to obtain accurate results from CBR, effective retrieval and matching of useful prior cases for the problem is essential, but it is still a controversial issue to design a good matching and retrieval mechanism for CBR systems. In this study, we propose a novel approach to enhance the prediction performance of CBR. Our suggestion is the simultaneous optimization of feature weights, instance selection, and the number of neighbors that combine using genetic algorithms (GA). Our model improves the prediction performance in three ways – (1) measuring similarity between cases more accurately by considering relative importance of each feature, (2) eliminating useless or erroneous reference cases, and (3) combining several similar cases represent significant patterns. To validate the usefulness of our model, this study applied it to a real-world case for evaluating cytological features derived directly from a digital scan of breast fine needle aspirate (FNA) slides. Experimental results showed that the prediction accuracy of conventional CBR may be improved significantly by using our model. We also found that our proposed model outperformed all the other optimized models for CBR using GA.

Introduction

Case-based reasoning (CBR) is a problem-solving technique that is similar to the decision making process that human beings use in many real-world applications. It often shows significant promise for improving the effectiveness of complex and unstructured decision making. In theory, there is no possibility of overfitting in CBR because it uses specific knowledge of previously experienced problems rather than their generalized patterns. CBR is maintained in an up-to-date state because the case-base is updated in real-time, which is a very important feature for the real-world application. Also, it can explain why it provides a solution by presenting similar old cases. Consequently, it has been applied to various problem-solving areas including engineering, finance, marketing, and medical diagnosis. In particular, CBR is very appropriate for medical applications because the characteristics of CBR fit to medical domains very well. In usual, medical knowledge is incomplete, so medical applications put more stress on real cases than applications in other domains. In addition, the explanation capability of CBR can be more important in medical domains because it can be used as the helpful information source for decision makers (i.e. medical doctors).

Despite its various advantages, CBR has been criticized because its prediction accuracy is usually much lower than the accuracy of other AI techniques, especially artificial neural networks (ANNs). Thus, there have been many studies to enhance the performance of CBR. Among them, the mechanisms to enhance the case retrieval process such as the selection of the appropriate feature subsets (Cardie, 1993, Domingos, 1997, Siedlecki and Sklanski, 1989, Skalak, 1994), instance subsets (Chiu, 2002, Kelly and Davis, 1991, Liao et al., 2000, Shin and Han, 1999, Wettschereck et al., 1997), the determination of feature weights (Babu and Murty, 2001, Huang et al., 2002, Lipowezky, 1998, Sanchez et al., 1997, Skalak, 1993, Yan, 1993), and the number of neighbors that combine (Ahn et al., 2003, Lee and Park, 1999) have been most frequently studied.

One of the state-of-the-art techniques for CBR is simultaneous optimization of these parameters in CBR. Most prior research tried to optimize these parameters independently. However, we can easily imagine that the global optimization model for CBR which considers these parameters simultaneously may improve the prediction results.

This study proposes a novel hybrid approach that optimizes three parameters of CBR simultaneously by genetic algorithms (GAs) – (1) the weights of the features, (2) the training instances, and (3) the number of neighbor cases that combine. To validate the usefulness of our model, we apply it to the real-world case of breast cytology diagnosis via digital image analysis, and review the results produced by our model.

The rest of the paper is organized as follows. Section 2 briefly reviews prior studies, and Section 3 proposes our research model, the simultaneous optimization of feature weights, relevant instances and the number of neighbors that combine by the GA approach. In the next section, the explanation for the research design and experiments are presented, and Section 5 describes all the empirical results and their meanings. In the final section, the conclusions of the study are presented.

Section snippets

Prior research

In this section, we first review the general concept of CBR. After that, we examine the previous research to optimize it. We also review the recent studies regarding simultaneous optimization of several parameters for CBR systems. In the end, we examine the GA approach – the key method for simultaneous optimization – in detail.

Global optimization of feature weights, instance selection, and the number of neighbors that combine using genetic algorithms

This study proposes a novel CBR model whose feature weighting, instance selection, and k parameter of k-NN are optimized globally, in order to improve prediction accuracy of typical CBR systems. Our model employs GA to select a relevant instance subset and to optimize the weights of each feature and the number of neighbors that combine simultaneously using the reference and the test case-base. We call it GOCBR (Global Optimization of feature weights, instance selection, and the number of

Application data

In general, there are three available methods for diagnosing breast cancer: mammography, fine needle aspirate (FNA) with visual interpretation, and surgical biopsy. Among them, surgical biopsy is known to be the most accurate method, however it is invasive, time consuming, and costly. Thus, diagnosis systems based on digital image analysis that allow an accurate diagnosis without the need for a surgical biopsy are considered as a realistic alternative. FNA involves using a small gauge needle to

The results of GA-optimized CBRs

Table 2 shows the finally selected parameters of each model. As a result of GOCBR, we obtain 25 optimal weights of each feature and 176 optimal training instances to maximize the prediction result for the test set. Because there are totally 343 training samples, GOCBR selects about 51.31% from the total case-base as an optimal instance subset. As we can see from Table 2, GOCBR selects fewer instances than FISCBR (51.60%) and FWISCBR (79.02%), but it selects more instances than ISCBR (47.52%).

Comparison of the prediction performances

Conclusions

We have proposed a new hybrid CBR model using GA – GOCBR. Our proposed model optimizes feature weighting, instance selection, and the number of neighbors that combine simultaneously. By selecting optimal instances, it may reduce noises or distorted cases which lead erroneous prediction. Our model may also find appropriate nearest neighbors for CBR by applying optimal feature weights to similarity calculation, which may enhance the prediction accuracy. In addition, it generates prediction

References (33)

  • W. Siedlecki et al.

    A note on genetic algorithms for large-scale feature selection

    Pattern Recognition Letters

    (1989)
  • D.B. Skalak

    Prototype and feature selection by sampling and random mutation hill climbing algorithms

    Proceedings of the 11th international conference on machine learning

    (1994)
  • H. Yan

    Prototype optimization for nearest neighbor classifier using a two-layer perceptron

    Pattern Recognition

    (1993)
  • A. Aamodt et al.

    Case-based reasoning; foundational issues, methodological variations, and system approaches

    AI Communications

    (1994)
  • Ahn, H., Kim, K. -J., & Han, I., (2003). Determining the optimal number of cases to combine in an effective case-based...
  • H. Ahn et al.

    Hybrid genetic algorithms and case-based reasoning systems for customer classification

    Expert Systems

    (2006)
  • Cited by (0)

    View full text