A Rule Evaluation Support Method with Learning Models Based on Objective Rule Evaluation Indexes

In this paper, we present an evaluation of learning algorithms of a novel rule evaluation support method for post-processing of mined results with rule evaluation models based on objective indices. Post-processing of mined results is one of the key processes in a data mining process. However, it is difficult for human experts to completely evaluate several thousands of rules from a large dataset with noise. To reduce the costs in such rule evaluation task, we have developed a rule evaluation support method with rule evaluation models that learn from a dataset. This dataset comprises objective indices for mined classification rules and evaluation by a human expert for each rule. To evaluate performances of learning algorithms for constructing the rule evaluation models, we have done a case study on the meningitis data mining as an actual problem. Furthermore, we have also evaluated our method with ten rule sets obtained from ten UCI datasets. With regard to these results, we show the availability of our rule evaluation support method for human experts.


INTRODUCTION
In recent years, enormous amounts of data are stored on information systems in natural science, social science, and business domains. People have been able to obtain valuable knowledge because of the development of information technology. Data mining techniques combine different kinds of technologies such as database technologies, statistical methods, and machine learning methods, then, utilize data stored on database systems. In particular, if-then rules, which are produced by rule induction algorithms, are considered as one of the highly usable and readable outputs of data mining. For large datasets with hundreds of attributes including noise, the process often produces many thousands of rules. From such a large rule set, it is difficult for human experts to find out valuable knowledge which is rarely included in the rule set. To support such rule selection, many efforts use objective rule evaluation indices such as recall, precision, and other interestingness measurements (Hilderman, 2001;Tan, 2002;Yao, 1999) (Hereafter, we refer to these indices as "objective indices"). Further, it is difficult to estimate the subjective criterion of a human expert actually using a single objective rule evaluation index because his/her subjective criterion, such as "interestingness" or "importance," for the purpose is influenced by the amount of prior knowledge and the passage of time.
In this paper, we present an adaptive rule evaluation support method for human experts with rule evaluation models. Data Science Journal, Volume 6, Supplement, 10 May 2007 This method predicts the experts' criteria based on objective indices by re-using the results of the evaluations by human experts. Section 2 summarizes previous work; while in Section 3, we describe the rule evaluation model construction method based on objective indices. We present a performance comparison of learning algorithms for constructing rule evaluation models in Section 4.

RELATED WORK
Many research efforts have been performed to select valuable rules from mined large rule sets based on objective rule evaluation indices. Some of these works suggest indices to discover interesting rules from a large number of rules.
Focusing on interesting rule selection with objective indices, researchers have developed more than forty objective indices based on number of instances, probability, statistical values, information quantities, distance of rules or their attributes, and rule complexity (Hilderman, 2001;Tan, 2002;Yao, 1999). Most of these indices are used to remove meaningless rules rather than to discover ones of real interest to a human expert because they cannot include domain knowledge. In contrast, a dozen of subjective indices estimate how a rule fits with a belief, a bias, or a rule template formulated beforehand by a human expert. Although these subjective indices are useful to some extent in discovering really interesting rules because of their built-in domain knowledge, they depend on the precondition that a human expert is able to clearly formulate his/her interest. Although interestingness indices were verified as to their availabilities on each suggested domain, nobody has validated their applicability on other domains or their characteristics as related to the background of a given dataset. Ohsaki et al. (Ohsaki, 2004) investigated the relation between objective indexes and real human interests, taking actual data mining results and their evaluations by human experts. In this work, the comparison shows that it is difficult to predict real human interest with a single objective index. Based on this result, we find indications of the possibility of logical combination of the objective indices to predict actual human interest to experts more exactly.

ON OBJECTIVE INDICES
In practical data mining situations, costly rule evaluation procedures are repeatedly done by a human expert. In these situations, useful results of each evaluation such as focused attributes, interesting combinations, and valuable facts are not explicitly used by any rule selection system, but tacitly stored in the human expert. To address this problem, we suggest a method to construct rule evaluation models based on objective rule evaluation indices as a way to describe criteria used explicitly by a human expert, re-using previous human evaluations. Combining this method with the rule visualization interface, we have designed a rule evaluation support tool, which can carry out more exact rule evaluation with explicit rule evaluation models.

Constructing a Rule Evaluation Model
We considered the process of modeling rule evaluation of human experts as the process of clarifying the relationships between human evaluation and features of inputted if-then rules. Based on this consideration, we decided that the rule evaluation model construction process can be implemented as a learning task. Figure 1 shows the rule evaluation model construction process based on the re-use of human evaluations and objective indices for each mined rule. In the training phase, the attributes of a meta-level training data set are obtained by objective indices such as recall, precision, and other rule evaluation values. The human evaluation for each rule is combined as classes of each instance. To obtain this data set, a human expert has to evaluate the whole or a part of the input rules at least once. After obtaining the training data set, its rule evaluation model is constructed by using a learning algorithm.
In the prediction phase, a human expert receives predictions for new rules based on their objective index values. Because rule evaluation models are used for predictions, we need to choose a learning algorithm with high accuracy similar to the current classification problems.

A Tool to Support Rule Evaluation with Rule Evaluation Models
Our rule evaluation support tool implements interactive support during the time a human expert evaluates rule sets from mining procedure. The first time analyzing a rule set with a totally new task, a human expert sorts them based on some objective indices. Then he/she evaluates the whole or part. On the other hand, if there are previous evaluation results by human experts for the same or similar problem of input rules, possible predictions of the rules can be displayed to a human expert. To obtain the rule set predictions, this tool uses the procedure of the construction of rule evaluation models. Then a human expert corrects the displayed predictions during his/her evaluation. With the corrected evaluations by a human expert, the system rebuilds a rule evaluation model. With the above procedures, our rule evaluation support tool provides rule evaluation support for a human expert as shown in Figure 2.   A human expert can use this rule evaluation support tool both as both a passive support tool with sorting functions based on objective indices and an active support tool with predictions of rule evaluation models learned from a dataset based on objective indices.

PERFORMANCE COMPARISONS OF LEARNING ALGORITHMS FOR RULE MODEL CONSTRUCTION
To predict human evaluation labels of a new rule based on objective indices more accurately, we have to construct a rule evaluation model with a higher predictive accuracy. In this section, we first present the result of an empirical evaluation with the dataset obtained from the result of a meningitis data mining (Hatazawa, 2000). Then, to confirm the performance of our approach on the other datasets, we evaluated five algorithms on ten rule sets obtained from ten UCI benchmark datasets (Hettich, 1998). Based on the experimental results, we discuss the following: accuracy of rule evaluation models, analysis of learning curves of the learning algorithms, and contents of the learned rule evaluation models. For evaluating the accuracy of the rule evaluation models, we have compared predictive accuracies on the entire dataset and Leave-One-Out validation. The accuracy of a validation dataset D is calculated with correctly predicted instances: is the number of correctly predicted instances, and |D| is the size of the dataset.
The recall of class i on a validation dataset is calculated using correctly predicted instances about the class Correct(D i ) as: Further, the precision of class i is calculated using the size of instances, which are predicted i as:

Precision(D i )=(Correct(D i )/Predicted(D i ))*100.
With regard to the learning curves, we obtained curves of accuracies of learning algorithms on the entire training dataset to evaluate whether each learning algorithm can perform in the early stage of rule evaluation process. Accuracies of randomly sub-sampled training datasets are averaged with 10 trials on each percentage of the subset. By observing the elements of the rule evaluation models on the meningitis data mining result, we consider the characteristics of the objective indices that are used in these rule evaluation models. In order to construct a dataset to learn a rule evaluation model, the values of the objective indices have been calculated for each rule by considering 39 objective indices as shown in Table 1. Thus, each dataset for each rule set has the same number of instances as the rule set. Each instance has 40 attributes including those of the class.  [Tan (2002)], Kloesgen's Interestingness (KI) [Kloesgen (1996)], Relative Risk (RR) [Ali (1997)], Brin's Interest (BI) [Brin (1997)], Brin's Conviction (BC) [Brin (1997)], Certainty Factor (CF) [Tan (2002)], Jaccard Coefficient (Jaccard) [Tan (2002)], F-Measure (F-M) [Rijsbergen (1979)], Odds Ratio (OR) [Tan (2002), Yule's Q (YuleQ) [Tan (2002) We applied five learning algorithms to these datasets to compare their performances as a rule evaluation model construction method. We used the following learning algorithms from Weka (Witten, 2000): C4.5 decision tree learner (Quinlan, 1993) called J4.8, neural network learner with back propagation (BPNN) (Hinton, 1986), support vector machines (SVM) (Platt, 1999), classification via linear regressions (CLR) , and OneR (Holte, 1993).

Constructing Rule Evaluation Models for an Actual Datamining Result
In this case study, we have considered 244 rules, which are mined from six datasets about six types of diagnostic problems as shown in Table 2. In these datasets, appearances of meningitis patients were considered as attributes and the diagnosis of each patient as a class. Each rule set was mined with its proper rule induction algorithm composed by a constructive meta-learning system called CAMLET (Hatazawa, 2000). For each rule, we labeled three evaluations (I: Interesting, NI: Not-Interesting, NU: Not-Understandable) according to evaluation comments provided by a medical expert. Diag  29  6  53  15  38  0  C_Course  40  12  22  3  18  1  Culture+diag  31  12  57  7  48  2  Diag2  29  2  35  8  27  0  Course  40  2  53  12  38  3  Cult_find  31  2  24  3  18  3  TOTAL  --244  48 187 9

Comparison of Classification Performances
In this section, we present the result of the accuracy comparison over the entire dataset, recall of each class label, and their precision. Because Leave-One-Out holds just one test instance and the remaining as the training dataset repeatedly for each instance of a given dataset, we can evaluate the performance of a learning algorithm to a new dataset without any ambiguity.
The results of the performances of the five learning algorithms to the entire training dataset and the results of Leave-One-Out are also shown in Table 3. All the Accuracies, Recalls of I and NI, and Precisions of I and NI are higher than those of the predicting default labels. Recall Precision Acc.
As compared to the accuracy of OneR, the other learning algorithms achieve equal or higher performances using combinations of multiple objective indices than by sorting with a single objective index. With regard to the Recall values over class I, BPNN has achieved the highest performance. The other algorithms exhibit lower performance than that of OneR because they tend to be learned classification patterns for the major class NI. The accuracy of Leave-One-Out demonstrates the robustness of each learning algorithm. The Accuracy (%) of these learning algorithms ranges from 75.8% to 81.9%. However, these learning algorithms have not been able to classify the instances of class NU because it is difficult to predict a minor class label in this dataset.

Learning Curves of the Learning Algorithms
Since the rule evaluation model construction method requires the mined rules to be evaluated by a human expert, we have investigated learning curves of each learning algorithm to estimate a minimum training subset to construct a valid rule evaluation model. The table in the upper portion of Figure 3 shows the accuracies to the entire training dataset with each subset of training dataset. The percentage of achievements of each learning algorithm compared with their accuracy over the whole dataset is shown in the lower section of Figure 3. As observed in these results, SVM and CLR, which use hyper-planes, obtained an achievement ratio greater than 95% using less than 10% of training subset. Although a decision tree learner and BPNN could determine better classifiers to the entire dataset than the hyper-plane learners, they need more training instances to determine accurate classifiers.  In order to eliminate known ordinary knowledge from a large rule set, the non-interesting rules need to be classified correctly. The right upper table in Figure 3 shows percentage of Recalls on NI. The right lower chart in Figure 3 also shows the percentage of achievements of Recall of NI and compares it with the Recall of NI of the entire training dataset. From this result, we can eliminate the NI rules with rule evaluation models from SVM and BPNN although only 10% of rule evaluations are conducted by a human expert. This fact is guaranteed with no less than 80% precision for all learning algorithms.

Rule Evaluation Models on the Actual Datamining Result Dataset
In this section, we present rule evaluation models for the entire dataset learned using OneR, J4.8 and CLR. This is because they are represented as explicit models such as a rule set, a decision tree, and linear model set.  As shown in Figure 4 and Figure 5, the indices used in the learned rule evaluation models are not only taken from a group of indices that increase with correctness of a rule but also from different groups of indices. YLI1, Laplace Correction, Accuracy, Precision, Recall, Coverage, PSI and, Gini Gain are indices which are formally used on the models. The latter indices are GBI and Peculiarity, which sum up the difference in antecedents between one rule and the other rules in the same rule set. This corresponds to the comments provided by the human expert who said that he evaluated these rules not only according to their correctness but also to their interestingness based on his expertise

Constructing Rule Evaluation Models on Artificial Evaluation Labels
We have also evaluated our rule evaluation model construction method using rule sets obtained from five datasets of the UCI machine learning repository to confirm the lower limit performances on probabilistic class distributions. We selected the following ten datasets: Anneal, Audiology, Autos, Balance-scale, Breast-cancer, Breast-w, Colic, Credit-a, Waveform, and Letter. With these datasets, we obtained rule sets with bagged PART, which repeatedly executes PART  to the bootstrapped training subsample datasets. For these rule sets, we calculated 39 objective indices as attributes of each rule. With regard to the classes of these datasets, we used three class distributions with multi-nomial distribution. Table 4 shows the process flow diagram for obtaining these datasets and their description with three different class distributions. The class distribution for "Distribution I" is P=(0.3,0.35,0.35) where p i is the probability of class i. Thus, the number of class i instances in each dataset D j become p i D j . Similarly, the probability vector of "Distribution II" is P=(0.3,0.5,0.2) and that of "Distribution III" is P=(0.3,0.65,0.05 Table 4. Flow diagram to obtain datasets and the datasets of the rule sets learned from the UCI benchmark datasets.

Accuracy Comparison on Classification Performances
In the above mentioned datasets, we have used the five learning algorithms to estimate if their classification results reach or exceed the accuracies of that of just predicting each default class. The left table of Table 5 shows the accuracies of the five learning algorithms applied to each class distribution of the three datasets. As shown in Table 5 ).

Evaluation of Learning Curves
Similar to the evaluations of the learning curves on the meningitis rule set, we have estimated minimum training subsets for a valid model, which works better for just predicting a default class. The right table in Table 5 shows the sizes of the minimum training subsets, which can help construct more accurate rule evaluation models than percentages of a default class formed by each learning algorithm. With smaller datasets, these learning algorithms have been able to construct valid models with less than 25% of the given training datasets. However, for larger datasets such as Waveform and Letter, they need more training subsets to construct valid models because their performance with the entire training dataset fall to the percentages of default class of each dataset as shown in the left table in Table 5.

CONCLUSION
In this paper, we have described the evaluation of five learning algorithms for a rule evaluation support method with rule evaluation models to predict evaluations for an if-then rule based on objective indices by re-using evaluations by a human expert. Based on the performance comparison of the five learning algorithms, rule evaluation models have achieved higher accuracies for just predicting each default class. Considering the difference between the actual evaluation labeling and the artificial evaluation labeling, it is shown that the evaluation of the medical expert considered particular relations between an antecedent and a class another antecedent in each rule. By using these learning algorithms for estimating the robustness of a new rule with Leave-One-Out, we have achieved accuracy greater than 75.8%. By evaluating learning curves, SVM and CLR were observed to have achieved an achievement ratio greater than 95% using less than 10% of the subset of the training dataset, which includes certain human evaluations. These results indicate the availability of this rule evaluation support method for a human expert.
In the future, we will introduce a selection method of learning algorithms to construct a proper rule evaluation model according to each situation. We also apply this rule evaluation support method to estimate other data mining results such as decision tree and rule set and combine them with objective indices, which evaluate all the mining results.