A combined data mining approach using rough set theory and case-based reasoning in medical datasets

Article history: Received October 15, 2013 Received in revised format March 2 2014 Accepted April 17, 2014 Available online April 23 2014 Case-based reasoning (CBR) is the process of solving new cases by retrieving the most relevant ones from an existing knowledge-base. Since, irrelevant or redundant features not only remarkably increase memory requirements but also the time complexity of the case retrieval, reducing the number of dimensions is an issue worth considering. This paper uses rough set theory (RST) in order to reduce the number of dimensions in a CBR classifier with the aim of increasing accuracy and efficiency. CBR exploits a distance based co-occurrence of categorical data to measure similarity of cases. This distance is based on the proportional distribution of different categorical values of features. The weight used for a feature is the average of cooccurrence values of the features. The combination of RST and CBR has been applied to real categorical datasets of Wisconsin Breast Cancer, Lymphography, and Primary cancer. The 5fold cross validation method is used to evaluate the performance of the proposed approach. The results show that this combined approach lowers computational costs and improves performance metrics including accuracy and interpretability compared to other approaches developed in the literature. © 2014 Growing Science Ltd. All rights reserved.


Introduction
Classification is a widely used technique in various fields, including data mining and knowledge discovery, which maps each item of the selected data onto one of a given set of classes.There are three categories of classification models: single classifier, hybrid classifier, and ensemble classifier.Regression, discrimination analysis, artificial neural networks, support vector machine, decision trees, and case-based reasoning are some instances of single classifiers.The hybrid classifiers use several classifiers so that an integrated classifier removes disadvantages of using just other classifiers and improve classification accuracy such as classifiers proposed by Zeinal Hamadani et al. (2013) and Khashei et al. (2013).The ensemble classifiers such as classifiers proposed by Reboiro-Jato et al. (2014) andDe Stefano et al. (2014) combine multiple classification models together as a council to make more appropriate decisions.The proposed classifier in this paper is located in second category.Case based reasoning (CBR), the classification system developed by Schank (1982), is able to automatically predict the class of new unclassified records.CBR solves a new problem by recalling previous similar cases and reusing the information and knowledge of those cases.CBR comprises four steps: (1) retrieving the most similar case(s), (2) reusing existing knowledge of previous cases to solve the new problem, (3) revising the proposed solution if necessary, and (4) retaining the new solution as a part of future problem solving (Aamodt & Plaza, 1994).Retrieval is the most important step in the CBR process.The most common method to retrieve cases is to apply the k-nearest neighborhood rule, which uses a distance function to measure the similarity between the new case and labeled case(s) (Watson, 1999).CBR systems are sensitive to unreliable, noisy and inconsistent data.In CBR literature, these problems have been received attention from two areas of research: (1) instance selection with the aim of reducing the number of redundant cases and (2) feature selection with the aim of identifying as many irrelevant features as possible (Salamó & López-Sánchez, 2011) Moreover, different researches on CBR systems set its key parameters such that classification outcome is improved.These works contain the various methods of measuring similarity or distance as well as determining k in the k-nearest neighborhood method.In most situations, the Euclidean distance is used to measure the distance between two cases with numerical features and the Hamming distance to estimate the distance between two cases with categorical features.However, in some of cases, the Euclidean distance is used for categorical features.These criteria are not suitable for measuring the distance between categorical features and do not provide a good evaluation of the degree of similarity.Therefore, developing a proper distance function can be a useful step in enhancing the efficiency and accuracy of CBR systems.Rough Set Theory (RST) is one of the techniques for identifying and detecting common patterns in data, especially in the case of vague, uncertain and incomplete data (Pawlak, 1982).Exact and metaheuristic algorithms based on RST have been developed for feature selection (Yao et al., 2008;Yao & Zhao, 2009;Rezvan et al., 2014;Hedar et al., 2008;Chen et al., 2010).This theory has been used as a well-known feature selection technique in combination with learning algorithms.Yun et al. (2004) used it for feature selection to create a decision tree with minimum number of leaves.Also, Zhang and Yao (2004) used RST as a feature selection method and extracted the classification rules.Rao and Sarma (2003) developed a methodology based on rough-fuzzy sets and proposed a decision support tool based on this methodology for retrieval of candidate components for reuse software.Huang and Tseng (2004) applied rough set method to reduce the computational complexity of CBR.They proposed a semi-structured format for case representation using the Zachman framework and presented an efficient approach to reasoning similar cases for decision-making.Li et al. (2006) presented a novel rough set-based case-based reasoner for application of text categorization.Jiang et al. (2006) presented a novel methodology for utilizing a fuzzy similarity-based Rough Set algorithm in feature weighting and reduction for CBR systems in tool selection for die and mold NC machining.Liu and Yu (2009) applied rough set to remove out the non-correlated parameters for simplifying case library of CBR and searching the most similar cases implemented by the rough set rules.Lin et al. (2009) developed a hybrid failure prediction model by using rough set theory and grey relational analysis as data preprocessors to strengthen the effectiveness of CBR predicting capability.Louhi-Kultanen et al. (2009) applied rough sets and fuzzy sets in the adaptation phase of the CBR for fluidized-bed crystallization.Their proposed approach may save the time-consuming experimental work by predicting the crystal size distribution of a new compound.Salamó and López-Sánchez (2011) investigated dimensionality reduction based on rough sets for CBR on datasets from the UCI repository.Their main focus was to develop strategies for feature selection and propose several measures for estimating attribute relevance based on RST.Du et al. (2012) presented a system based on CBR to solve geographic problems.This system uses a rough set-based algorithm to prune essential spatial relations and extract decision rules.In this system, the derived solution from the past similar cases was not accepted to solve the new problem unless it also satisfied the decision rules.Lu et al. (2012) developed a hybrid approach based on RST and CBR to allow doctors to modify patients' treatment processes to changes of patients' clinical states.Wang et al. (2012) applied the soft fuzzy rough set as feature reduction in constructing the case-based classifier for highly cited papers.They concluded that features such as research capabilities of the first author, the papers' quality and the reputation of journal are the most relevant predictors for highly cited papers.Chuang (2013) proposed three CBR-based hybrid models for business failure prediction combining RST with CBR, integrating RST and grey relational analysis with CBR, and mixing classification and regression tree and CBR together.This paper presents a combination approach which uses RST to select features in the data preprocessing stage and exploit CBR for classification.An exact algorithm is used for feature selection, along with a distance function based on the proportional distribution of different categorical values of features in the CBR system.This approach is evaluated on categorical datasets from the UCI repository.Finally, the obtained results are compared with the ones of pervious researches.The remainder of this paper is organized as follows.Section 2 outlines preliminary notions of rough set model.The CBR model, as a new classification method, is introduced in Section 3. In Section 4, the comparative experiments are performed to evaluate the performance of the combination approach of RST and CBR.Finally, concluding remarks are drawn from this study in Section 5.

Rough Set Theory
RST was introduced by Pawlak (1982) to deal with inexact, uncertain or vague information.This theory, as powerful mathematical tool, does not need any preliminary or additional information about the data under analysis.In this section, some basic definitions of rough sets are presented (Pawlak, 1982;Pawlak, 1990).
is a finite set of features, where C represents a set of conditional features and D is a decision feature, where a V is the value set of a .Every feature

Definition 2 (Indiscernibility relation). For every set of features
. Definition 3 (Lower and upper approximation sets).Let R be an equivalence relation on U induced by a set of features and an arbitrary U X  .Definitions of lower and upper approximations of X with respect R follow: , is the union of all elementary sets that certainly belong to X and the upper approximations of X , ) X ( R  , is the union of all elementary sets that possibly belong to X . Definition 4 (Positive, negative and boundary region).After defining these approximations of X the reference universe U is divided into three different regions: R -positive region of X , , and R -negative region of X , ) X ( NEG R .They are given by the following formulae: is the positive region of decision D on R and


where D is the decision feature and D / U is the set of the equivalence classes generated by D .
Definition 5 (Degree of dependency).Given the decision table, the degree of dependency D on R can be defined as where  denotes the cardinality of a set.
Definition 6 (Reduct and core).Reducts and cores are fundamental concepts of RST.The reduct is the set of essential features which can discern all objects.A reduct is a subset The reduct set is a minimal subset of features that preserves the degree of dependency of decision features on full conditional features.The core is the common features of all reducts.In other words, the core is intersection of all the relative reduct sets.This paper uses the proposed exact algorithm by Rezvan et al. (2014).This algorithm examines the solution tree of feature selection problem by a breadth-first strategy and by holding the pruned nodes of this tree in a trie tree as data structure avoid to the additional calculations.This algorithm is able to detect all of optimal solution(s).

Case-based Reasoning
The key parameters of CBR including feature selection, feature weighting, applying similarity measure and determining k in the k-nearest neighborhood method are of great importance such that each parameter can have a significant role in the efficiency of classification.Feature selection is performed using RST mentioned in the previous section without considering the CBR system and other parameters of CBR systems will be discussed in the following.The similarity measure is the key measure of the CBR system for reliably classifying new samples.This measure can be considered as the sum of the distance of features between two cases.If the similarity measure is unable to separate cases adequately, then the CBR system will perform poorly.Thus, selecting an appropriate similarity measure is necessary for CBR systems.If features of the case have numerical values, the distance between the two cases can be calculated using Manhattan distance, Euclidean distance, etc; but computing the distance between the two cases with categorical values is difficult particularly when each feature has more than two categorical values.The Hamming distance is a well-known distance function for measuring categorical features that is given as shows the distance between the two cases in feature c .This function does not perform well in some cases.The distance function can be defined as a function of the maximum likelihood approach.In other words, this distance is based on the proportional distribution of different categorical values of a feature.This distance has been presented by Rezvan et al. (2013) for the distance function of case based reasoning system in mixed features.Suppose, m is the number of categorical features each of which has m r values.The distance between the two values x and y of is computed with 1 m  other features represented using . Let us assume that z is one of i th feature values, now it should be determined whether z belongs to j A or j A .If The distance between the two values of x and y of categorical feature is given as Finally, the distance between the two cases a and b is defined as In this distance function i w is the weight assigned to the i th feature.Assigning appropriate weights to the features can improve the performance of the case-based reasoning system and also reduce its sensitivity with respect to the distance function.This weight assignment to the i th feature is given as


The k-nearest neighborhood(s) method searches k cases with maximum similarity to obtain a majority decision.The number of cases in the dataset can affect the suitable value of k when it is being determined using trial and error.There are pseudo codes the proposed CBR model in Appendix.

Experimental results
The combined approach using rough set theory for feature selection and the developed case based reasoning system, as a classification model, is implemented using the C#.NET programming language.The framework of this approach is shown in Fig. 1.

Fig. 1. Framework of combined approach of RST and CBR
To evaluate this approach, three datasets of UCI in applied area of disease diagnosis are used (Hettich et al., 1998).Their specifications are presented in Table 1.The experimental results can be seen in Table 2. Based on these results, one can say when the number of nearest neighbors increases, the accuracy improves to some extent.The obtained results show that the performance of the combined approach cannot be worse than the proposed CBR system without feature selection.As evident from the two datasets "Lymphography", and "Primary Tumor", the performance of the combined approach is better than the single CBR system.As more explanation, "Lymphography" dataset containing 18 features and 148 instances has one minimal reduct with length 6.This minimal reduct possesses features 2, 13, 14, 15, 16, and 18. "Wisconsin Breast Cancer" dataset containing 9 features and 699 instances also has 8 minimal reducts with length 4. One of 8 minimal reducts is set } 7 , 6 , 5 , 3 { .These two datasets are consistent since their dependency degree is equal to 1; but "Primary Tumor" dataset is an inconsistent dataset so that the dependency degree is 0.7109144.Based on result shown in Table 2, the suitable number of nearest neighborhood for the developed CBR system and combined approach is determined 5 for "Wisconsin Breast Cancer" and "Lymphography" datasets and 7 for "Primary Tumor".

Table 2
The average and standard deviation accuracy for the developed CBR system and the combined approach

Dataset
The This approach is compared with other classifiers developed by other authors in Tables 3, 4 and 5.These comparisons show that the performance of this approach is acceptable and its accuracy is better than classifies in most cases.It should be noted that classifies such as neural networks and support vector machines have slightly higher accuracy in Table 3, but the interpretation of these models in comparison with the combined approach is lower.The available results in Tables 4 and 5 show that the combined approach of this paper has the competitive performance with other classifiers presented in previous researches.Because neural networks and support vector machines act as black boxes they do not give the useful interpretations to researcher.Meanwhile, the combined approach determines the effective features in classification to observe the weight of features.On the other hand, if the diagnostic disease datasets can be connected to the treatment dataset, one can derive other useful results.For instance, in the breast cancer dataset a new case using CBR system is diagnosed as either benign or malignant and the treatment dataset contains two classes labeled life or death, by connecting these datasets one can use treatments which lead to life.In fact, there may be benign cases which result in death and malignant cases which survive.The combined approach 39.70

Conclusion
Reducing the number of irrelevant, unnecessary or redundant features helps CBR classifier to retrieve the most relevant information in large datasets.Noise reduction and computation efficiency are advantages of appropriately feature selection, which has an inescapable effect on the classification accuracy.This paper presented a combination approach of RST and CBR system in which RST is used for feature selection and the CBR system as a classifier model.The CBR system utilizes the distance based on proportional distribution of different categorical values of features.By calculating this distance and determining the weight of the features, the new case is classified based on the nearest cases.
The experimental results on categorical datasets of UCI repository in disease area shows the performance of this approach is not worse than the single CBR system and it obtains improved results on the three datasets.Further investigations on other datasets can demonstrate the efficiency of this approach.Finally, it is worthwhile to note that the RST is adequately comprehensive to be applicable across a wide range of single classifiers.

Table 1
Specifications of datasets under study Dataset Number of instances Number of feature Number of class

Table 3
The proposed model in comparison with other classifiers in "Wisconsin Breast Cancer" dataset

Table 4
The proposed model in comparison with other classifiers in "Lymphography" dataset

Table 5
The proposed model in comparison with other classifiers in "Primary Tumor" dataset