Reduced Data Sets and Entropy-Based Discretization

Results of experiments on numerical data sets discretized using two methods—global versions of Equal Frequency per Interval and Equal Interval Width-are presented. Globalization of both methods is based on entropy. For discretized data sets left and right reducts were computed. For each discretized data set and two data sets, based, respectively, on left and right reducts, we applied ten-fold cross validation using the C4.5 decision tree generation system. Our main objective was to compare the quality of all three types of data sets in terms of an error rate. Additionally, we compared complexity of generated decision trees. We show that reduction of data sets may only increase the error rate and that the decision trees generated from reduced decision sets are not simpler than the decision trees generated from non-reduced data sets.


of 11
problem, namely how reduction of the attribute set as a side-effect of discretization of numerical attributes changes an error rate, was discussed in Reference [17].
For symbolic attributes, it was shown [22] that the quality of rule sets induced from reduced data sets, measured by an error rate evaluated by ten-fold cross validation, is worse than the quality of rule sets induced form the original data sets, with no reduction of the attribute set.

Reducts
The set of all cases of the data set is denoted by U. An example of the data set with numerical attributes is presented in Table 1. For simplicity, all attributes have repetitive values, though in real-life numerical attribute values are seldom repetitive. In our example U = {1, 2, 3, 4, 5, 6, 7, 8}. The set of all attributes is denoted by A. In our example A = {Length, Height, Width, Weight}. One of the variables is called a decision, in Table 1 it is Quality. Let B be a subset of the set A of all attributes. The indiscernibility relation I ND(B) [23,24] is defined as follows (x, y) ∈ I ND(B) if and only if a(x) = a(y) for any a ∈ B, where x, y ∈ U and a(x) denotes the value of an attribute a ∈ A for a case x ∈ U. The relation I ND(B) is an equivalence relation. An  An idea of the reduct is important since we may restrict our attention to a subset B and construct a decision tree with the same ability to distinguish all concepts that are distinguishable in the data set with the entire set A of attributes. Note that any algorithm for finding all reducts is of exponential time complexity. In practical applications, we have to use some heuristic approach. In this paper, we suggest two such heuristic approaches, left and right reducts.
A left reduct is defined by a process of a sequence of attempts to remove one attribute at a time, from right to left and by checking after every attempt whether B * ≤ {d} * , where B is the current set of attributes. If this condition is true, we remove an attribute. If not, we put it back. For the example presented in Table 1, we start from an attempt to remove the rightmost attribute, that is, Weight. Similarly, a right reduct is defined by a similar process of a sequence of attempts to remove one attribute at a time, this time from left to right. Again, after every attempt we check whether B * ≤ {d} * . It is not difficult to see that the right reduct is the set {Width, Weight}.
For a discretized data set we may compute left and right reducts, create three data sets: with the discretized (non-reduced) data set and with attribute sets restricted to the left and right reducts and then for all three data sets compute an error rate evaluated by C4.5 decision tree generation system using ten-fold cross validation. Our results show again that reduction of data sets causes increase of an error rate.

Discretization
For a numerical attribute a, let a i be the smallest value of a and let a j be the largest value of a. In discretizing of a we are looking for the numbers a i 0 , a i 1 ,..., a i k , called cutpoints, where a i 0 = a i , a i k = a j , a i l < a i l+1 for l= 0, 1,..., k − 1 and k is a positive integer. As a result of discretization, the domain [a i , a j ] of the attribute a is divided into k intervals In this paper we denote such intervals as follows Discretization is usually conducted not on a single numerical attribute but on many numerical attributes. Discretization methods may be categorized as supervised or decision-driven (concepts are taken into account) or unsupervised. Discretization methods processing all attributes are called global or dynamic, discretization methods processing a single attribute are called local or static.
Let v be a variable and let v 1 , v 2 ,..., v n be values of v, where n is a positive integer. Let S be a subset of U. Let p(v i ) be a probability of v i in S, where i = 1, 2,..., n. An entropy H S (v) is defined as follows All logarithms in this paper are binary. Let a be an attribute, let a 1 , a 2 ,..., a m be all values of a restricted to S, let d be a decision and let d 1 , where p(d i |a j ) is the conditional probability of the value d j of the decision d given a j ; j ∈ {1, 2, ..., m} and i ∈ {1, 2, ..., n}.
Let S be a subset of U, let a be an attribute and let q be a cutpoint splitting the set S into two subsets S 1 and S 2 . The corresponding conditional entropy, denoted by H S (d|a) is defined as follows where |X| denotes the cardinality of the set X. Usually, the cutpoint q for which H S (d|a) is the smallest is considered to be the best cutpoint. We need how to halt discretization. Commonly, we halt discretization when we may distinguish the same cases in the discretized data set as in in the original data set with numerical attributes. In this paper discretization is halted when the level of consistency [26], defined as follows

Equal Frequency per Interval and Equal Interval Width
Both discretization methods, Equal Frequency per Interval and Equal Interval Width, are frequently used in discretization and both are known to be efficient [25]. In local versions of these methods, only a single numerical attribute is discretized at a time [31]. The user provides a parameter denoted by k. This parameter is equal to a requested number of intervals. In the Equal Frequency per Interval method, the domain of a numerical attribute is divided into k intervals with approximately equal number of cases. In the Equal Interval Width method, the domain of a numerical attribute is divided into k intervals with approximately equal width.
In this paper we present a supervised and global version of both methods, based on entropy [26]. Using this idea, we start from discretizing all numerical attributes assuming k = 2. Then the level of consistency is computed for the data set with discretized attributes. If the level of consistency is sufficient, discretization ends. If not, we select the worst attribute for additional discretization. The measure of quality of the discretized attribute, denoted by a d and called the average block entropy, is defined as follows A discretized attribute with the largest value of M(a d ) is the worst attribute. This attribute is further discretized into k + 1 intervals. The process is continued by recursion. The time computational complexity, in the worst case, is O(m · logm · n 2 ), where m is the number of cases and n is the number of attributes. This method is illustrated by applying the Equal Frequency per Interval method for the data set from Table 1. Table 2 presents the discretized data set for the data set from Table 1. It is not difficult to see that the level of consistency for Table 2 is 1. For the data set presented in Table 2, both left and right reducts are equal to each other and equal to {Height d , Width d , Weight d }. Table 3 presents the data set from Table 1 discretized by the Global Equal Interval Width discretization method. Again, the level of consistency for Table 3 is equal to 1. Additionally, for the data set from Table 3, both reducts, left and right, are also equal to each other and equal to {Length d , Width d }.

Experiments
We conducted experiments on 13 numerical data sets, presented in Table 4. All of these data sets may be accessed in Machine Learning Repository, University of California, Irvine, except for bankruptcy. The bankruptcy data set was described in Reference [37].
The main objective of our research is to compare the quality of decision trees generated by C4.5 directly from discretized data sets and from data sets based on reducts, in terms of an error rate and tree complexity. Data sets were discretized by the Global Equal Frequency per Interval and Global Equal Interval Width methods with the level of complexity equal to 1. For each numerical data set three data sets were considered: • an original (non-reduced) discretized data set, • a data set based on the left reduct of the original discretized data set and • a data set based on right reduct of the original discretized data set.
The discretized data sets were inputted to the C4.5 decision tree generating system [21]. In our experiments, the error rate was computed using an internal mechanism of the ten-fold cross validation of C4.5.
Additionally, an internal discretization mechanism of C4.5 was excluded in experiments for left and right reducts since in this case data sets were discretized by the global discretization methods, so C4.5 considered all attributes as symbolic.
We illustrate our results with Figures 1 and 2. Figure 1 presents discretization intervals for yeast data set, where discretization was conducted by the internal discretization mechanism of C4.5. Figure 2 presents discretization intervals for the same data set with discretization conducted by the global version of the Equal Frequency per Interval method (right reducts and left reducts were identical).   Tables 5-8. These results were analyzed by the Friedman rank sum test with multiple comparisons, with 5% level of significance. For data sets discretized by the Global Equal Frequency per Interval method, the Friedman test shows that there are significant differences between the three types of data sets: the non-reduced discretized data sets and data sets based on left and right reducts. In most cases, the original, non-reduced data sets are associated with the smallest error rates than both left and right reducts. However, the test of multiple comparisons shows that the differences are not statistically significant.

Results of our experiments are presented in
For data sets discretized by the Global Equal Interval Width method results are more conclusive. There are statistically significant differences between non-reduced discretized data sets and data sets based on left and right reducts. Moreover, an error rate for the non-reduced discretized data sets is significantly smaller than for both types of data sets, based on left and right reducts. As expected, the difference between left and right reducts is not significant.
For both discretization methods and all types of data sets (non-reduced, based on left and right reducts) the difference in complexity of generated decision trees, measured by tree size and depth, is not significant. Table 8 shows the size of left and right reducts created from data sets discretized by the Global versions of Equal Frequency per Interval and Equal Interval Width methods. For some data sets, for example, for bupa, both left and right reducts are identical with the original attribute set.    Abalone  2  2  2  3  3  3  Australian  5  1  2  5  4  1  Bankruptcy  1  1  1  1  1  1  Bupa  2  2  2  2  2  2  Echocardiogram  2  2  2  3  3  3  Ecoli  3  2  2  3  3  3  Glass  5  4  5  5  5  5  Ionosphere  6  6  5  6  4  6  Iris  2  2  2

Conclusions
Our preliminary results [22] show that data reduction combined with rule induction causes an increase of the error rate. The current results, presented in this paper, confirm these results: the reduction of data sets, associated with C4.5 tree generation system, causes the same effect. Decision trees generated from reduced data sets increase the error rate as evaluated by ten-fold cross validation. Additionally, decision trees generated from reduced data sets, in terms of a tree size or tree depth, are not simpler than decision trees generated from non-reduced data sets.Therefore, it is obvious that reduction of data sets (or feature selection) should be used with caution since it may degrade results of data mining.
In the future we are planning to extend our experiments to large data sets and to include other classifiers than systems for rule induction and decision tree generation.