Progressive re-sampling for better rule set discovery

Rough set theory-based knowledge discovery systems have the good property of discovering rule sets solely based on data so that objectivity in the discovered rule sets is very good. But, sometimes this property may hinder to find better knowledge models, depending on the composition of target data sets, especially when the data sets have key-like attributes, or data sets have some attributes having poor dependency to class attributes. Key-like attributes have minutely detailed values and the correspondence between conditional attributes and class attributes is precise. This paper shows how we can surmount the problem thru experiment. Two data sets are adapted for the experiment. A data set that has attributes of key-like and poor dependency was selected. And the other data set that does not have such properties is used to present a counter case. The experiment showed very good results and we found precautions when we apply rough set-based knowledge discovery systems. In addition, the experiment confirmed the fact that adding more data for target data set may increase the accuracy of knowledge models using rough set.


Introduction
Rough set theory is invented to characterize imperfect knowledge, and is considered an alternative or supplementary for fuzzy set theory [1]. A good point of rough set theory is that it does not need any pre-assumption about knowledge domain, while fuzzy set theory needs to set the value of possibility that human expert should assign. If we have good knowledge about target domain, assigning such possibility or the grade of membership may not be difficult. But if we have very little knowledge about the target domain, such task can be very difficult, and many target domains for knowledge discovery are known to have such property [2]. In this sense rough set theory is a very good tool for such task, because knowledge discovery based on rough set theory is solely based on data. As an effort to compensate the subjectivity of assigning membership grade by human, there have been some efforts that try to use rough set theory-based technique. For example, Chakraborty applied rough set to refine membership functions [3], and Meng et al. tried to combine fuzzy set, soft set, and rough set together to generate so called soft fuzzy rough set [4]. On the other hand, because rough sets may give objective ability of judgment on dependency between attributes, some researchers have used rough set-based method as a pre-processing technique to determine a good subset of attributes from the raw data. For example, Coaquira and Acuña applied rough set-based dependency analysis between attributes, and selected some part of attributes [5]. Huijian and Feipeng applied principle component analysis that is combined with rough set-based method to select a subset of attributes for high dimensional business data before they supply the data to their SVM [6]. Some others use other technique as a pre-processing technique before they try to find knowledge models using rough setbased method. For example, Mahapatra et al. applied correlation analysis between attributes before they try to find rough set-based rules to find out whether dependency between attributes exist [7]. Therefore, these works show that rough set theory can be a good tool to determine dependency between attributes or to find knowledge models.
But, even though trying to find knowledge model from data themselves only is one of good points of rough set based knowledge discovery systems, sometimes this property may hinder to find better knowledge models depending on the composition of given data set, especially data have key-like attributes, or data have some conditional attributes that have poor dependency to class attributes. Keylike attributes have minutely detailed values and the correspondence between conditional attributes and class attributes is precise. In section 2 we will discuss the method to surmount the problem, and result of experiments will be discussed, and in section 3 conclusions will be presented.

The method and experiment
Rough set theory was suggested by Pawlak [8] in early '80s, and the theory has attracted much attention, because the theory is based on very solid mathematical background, while most other artificial intelligence technologies rely on imperfect or satisfactory methods like heuristics. For details about rough set theory and related technologies, please refer to [9]. Because rough set-based rule discovery systems try to find rules solely based on data, it is highly possible that data instances of higher occurrence as well as their attributes will be treated more importantly in the process of rule discovery to find more accurate rules.
So, in order to see progressively how often each attribute is referred in the found rule set, resampling will be performed, and rule set discovery process will be repeated until there is no further improvement in accuracy in our experiment. Note that supplying more training instances may increase the accuracy of trained knowledge models in general. For the experiment two data sets, called Hayes-Roth and Iris, are used that can be found in the UCI machine learning repository [10]. Hayes-Roth data set is selected because it has a key-like attribute, while Iris data set doesn't have. All the experiments are based on 10-fold cross validation. The used rough set-based rule discovery system is MODLEM which is one of successful implementation of rough set-based rule discovery system [11,12].

Hayes-Roth data set
Hayes-Roth data set contains data of exemplars that were used to build property set models of classification for concept learning [13]. The data set contains 5 conditional attributes, name, hobby, age, educational level, and marital status. All the conditional attributes except name attribute have a few nominal values. The name attribute in the original data set is transformed from numerical to nominal values because name attribute usually doesn't have numerical meaning. There are 3 classes, and total number of instances is 132.
The following 3 rules with accuracy of 0% in 10-fold cross validation were found from the original data set. Note that experiment is performed in 10-fold cross validation, so even the training stage has found 100% secure rules as we can see in the rule description, testing generates bad result. The rule set above reveals the property of rough set-based rule set generation method very wellrule set generation is frank, and solely based on data, but the rules are over-fitted as we see that name is unique for each data instance so that very poor generalization.
The following 3 rules with accuracy of 83.333% were found with re-sampling rate of 200%. Seven names are dropped from rule 1 above compared to the rule 1 from the original data -5, 7, 9, 35, 45, 67, and 105. Ten names are dropped from rule 2 above compared to the rule 2 from the original data -29, 37, 49, 53, 56, 65, 77, 80, 86, and 115. Two names are dropped from rule 3 above compared to the rule 3 from the original data -11, and 61. Even though the accuracy of the rule set is improved at lot, 44 instances are unclassified among 264 instances with the above rule set. Table 1 shows corresponding confusion matrix. As re-sampling rate was increased, the accuracy of found rule set increased with slight change in the condition part of the rule set. That is, a few more names were added as the re-sampling rate was increased. When re-sampling rate reached 800%, the accuracy became 100%.
From the experiments performed with various re-sampling rates we have found the fact that enough number of instances may improve the accuracy of the rough set based knowledge models for the data set. But, because attribute 'name' has key-like property, this fact hinders the generalization of rules, so we run the algorithm again to see the effect of eliminating 'name' attribute.
Thirteen rules were found with the accuracy of 78.7879% for the original data set after dropping name attribute. The following is the result.  Table 2 shows corresponding confusion matrix. Re-sampling rate 200% generated rule set of 16 rules with accuracy of 87.5%, and re-sampling rate 300% generated rule set of 18 rules with accuracy of 87.3737%. Table 3 shows the total number of covering instances for each attribute that constitutes each rule. For example, there are 11 instances each for attribute 'age, 'edu', and 'marital' covered by rule 1 in the above. Based on the summary in table 3 attribute hobby may be dropped because of the least contribution in rule set formation. After dropping name and hobby attributes, we found the following 11 rules with accuracy of 84.0909%. Note that this accuracy is comparable to the best accuracy yet found that is  Table 4 shows corresponding confusion matrix, and there is no unclassified instances. From the experiment above we see that dropping attribute hobby effects that other attributes may have more data instance, and the rough set rule discovery system does not care about minute details due to attribute hobby, resulting in better accuracy. Similar reason can be explained for attribute name also.

Iris data set
Iris data set contains data of 3 class of iris plant, iris Setosa, iris Versicolor, iris Virginica for 150 instances. There are 4 attributes, attribute sL contains sepal length in cm, attribute sW contains sepal width in cm, attribute pL contains petal length in cm, attribute pW contains petal width in cm. The problem domain of the data set is known as pattern recognition problem. Many works have been done to find better classification models for the data set since the data set became public in 1988 [15].
The following 12 rules with accuracy of 96% were found for the original data set. The accuracy is comparable to other researchers' one of best data mining models, for example, ensemble of Baysian Table 5 shows corresponding confusion matrix. As re-sampling rate was increased, the accuracy of found rule set was increased with slight change in the composition of found rules. Table 6 shows the summary.   The rank of attributes is like pL, pW, sL, and sW. But, there is no much difference in the importance of attributes with respect to data instance coverage for each attribute in the found rules. In order to see dropping some attributes from the original data set, we may drop sW attribute. After dropping sW attributes, we found 11 rules with slightly less accuracy of 94.6667% than the result from the original data set. Anyway this result is comparable to other researcher's previous best accuracy [16]. All in all, because the rough set rules from the original data is yet the best and the re-sampled data showed better results, from these two results-dropping an attribute and re-sampling for oversampling, we can conclude that we may improve the accuracy of the knowledge model of Iris data set, if we can have more data, not by dropping attributes.

Conclusions
Rough set theory has a very good property that related knowledge discovery tools can analyse data solely based on data themselves. So, a lot of research have been done and reported success of related works. But, even though trying to find knowledge models from data themselves only is one of good points of rough set based knowledge discovery systems, sometimes this property may hinder to find better knowledge models, depending on the composition of given data sets, especially when data have key-like attributes, or data have attributes that have a very small portion of dependency to class attributes. Key-like attributes have minutely detailed values and the correspondence between conditional attributes and class attributes is precise. In order to see how we can surmount the problem a data set that has such properties was selected and experimented. And the other data set that does not have such properties is used to show a counter case. The experiment with the two data sets showed very good results and also revealed precautions when we apply rough set-based knowledge discovery system. Moreover, the experiments confirm the fact that supplying more target data may increase the accuracy of knowledge models of rough set.