AN IMPROVED C4.5 MODEL CLASSIFICATION ALGORITHM BASED ON TAYLOR’S SERIES

C4.5 is one of the most popular algorithms for rule base classification. Many empirical features in the algorithm exist, such as continuous number categorization, missing value handling and over-fitting. However, despite its promising advantage over the Iterative Dichotomiser 3 (ID3), C4.5 has the major setback of presenting the equivalent result as the ID3, especially when the same number of attributes is used. This paper proposes a technique that will handle the setback reported in C4.5. The performance of the proposed technique is measured based on better accuracy. The Entropy of Information Theory is measured to identify the central attribute for the dataset. The researchers apply exponential splitting information (EC4.5) in utilizing the central attribute of the same dataset. The result obtained on introducing Taylor series suggested a far better result than when the C4.5 (gain ratio) was introduced.


INTRODUCTION
Decision tree, as the name implies, is a predictive model that can be viewed as a tree structure, where specifically each branch of the tree is a classification question and the leaves of the tree are partitions of the dataset with their classification [1]- [2].It is a logical model represented as a binary or multiclass tree that shows how the value of a target variable can be predicted by using the values of a set of predictor variables.Decision tree classifiers are considered "white box" classification models, as they can provide the explanation for their classification models and can be used directly for decision -making [3].Many decision tree algorithms exist and these include: Alternating Decision Tree (LAD), C4.5 or J48 Pruned Tree, Classification and Regression Tree (CART), Chi-squared Automatic Interaction Detection (CHAID), Quest, …etc.Decision tree algorithms such as C4.5 had been developed earlier and continue to be regularly used in solving everyday classification tasks.However, despite its promising advantage over the ID3 algorithm, C4.5 has the major setback of presenting the equivalent result as the ID3, especially when the same number of attributes is used.In this paper, the predictive performance of this algorithm is enhanced by proposing another technique that will handle the noticeable setback and even present a more promising result than the C4.5 using (gain ratio).It is on this background that the exponential modification of the gain ratio is being proposed.

RELATED WORK
ID3 tree algorithm was introduced in 1986 by Quinlan Ross.It is based on Hunt's algorithm and the algorithm is serially implemented.The ID3 uses an information gain measure in choosing the splitting attribute [4].The basic strategy in ID3 is the selection of splitting attributes with the highest information gain.That is; the amount of information associated with an attribute value that is related to the probability of occurrence [2].Once the attribute has been chosen, the amount of information is measured, which is known as entropy [5].Entropy is used to measure the amount of uncertainty, surprise or randomness in a dataset.The entropy will be zero when all the data in the set belong to the single class.One of the challenges with this approach is when ID3 selects the attribute having more number of values, which may necessarily not be the best attribute [5].When testing a small sample, data may be over-fitted or over-classified.At a time, only one attribute is used for the testing purpose.As specified above, continuous data is difficult to analyze, as many trees need to be generated to find the perfect place to split the data, which makes the algorithm computationally expensive.The mathematical model of C4.5 is given by Equation (1).
On the other hand, C4.5 algorithm is an extension of ID3 algorithm.It has an enhanced method of tree pruning that reduces miss-classification errors due to noise or too much detail in the training dataset found in ID3.It uses the gain ratio impurity method to evaluate the splitting attribute [2], [6].Quinlan Ross introduced split information to information gain of ID3 as an improvement to overcome the limitations of ID3, which are latency and over-fitting and it becomes computationally expensive in handling continuous data.The gain ratio is given by Equation (2). i.
It will increase the performance when the number of attributes differs.ii.
It will increase the performance when the number of attributes is the same.iii.
And it will decrease the percentage of uncertainty in C4.5 algorithm.

METHODOLOGY
To overcome the limitations of C4.5, the researchers used Taylor's Series to modify the splitting information of C4.5.

Data Collection
The study uses an existing instructor's performance dataset from Abubakar Tafawa Balewa University Bauchi, Nigeria.The data collected was cleaned, normalized and organized in a form suitable for data mining process using WEKA platform.Table 1 shows the data format used for the research.The data consists of both categorical and numerical data making it suitable to perform this experiment.

The Existing Model (C4.5)
The C4.5 algorithm is an improvement of the ID3 algorithm, developed by Quinlan Ross in 1993.It uses gain ratio as an extension of gain information of ID3.

Mathematical Model (C4.5)
Let's consider the following probability distribution ( =  1 ,  2 ,  3 ,  4 , … ,   ) and a dataset D and define the information carried by the distribution otherwise known as the entropy of , proposed by [14]- [15], [18] given as: And the gain information for a test A is given by: We can define the splitting information in the form: Let us consider a dataset D of some certain attributes with element  1,  2,  3, …   , where the gain ratio of such data set is given by: The two limitations associated with ID3; i.e., latency and over-fitting error are being improved by the gain ratio.The algorithm of C4.5 is shown below.

Algorithm of C4.5
Input: an attribute-valued dataset D The gain ratio is known to present a better result than the information gain if the set of element   ≠   , but if   =   , the result of the gain ratio and information gain is the same.We can see that from (4) if  = 1: =1 Equation (7) shows that when split info (β ) =1, then ID3 = C4.5.
To overcome this: If we let β be the subject, we can rewrite (7) as: Now, from (7), for β =1 Consider a Taylor's series , the series can be rewritten as: By taking the limit at  e is called optimal split information; therefore, it optimizes the splitting information by splitting the value away from (1).It works for both cases: when   =    ℎ   ≠   .The new technique suggests the introduction of a new parameter to the splitting information.We denote this  −  and it is defined in the form: This is equivalently defined as: The introduction of the new parameter suggests that the splitting values are spread around the value 1.This helps in obtaining a better result.The division of Equation (2) by Equation ( 5) leads to the new method .4.5 which it is defined as:

Algorithm of the Proposed EC4.5
Input: an attribute-valued dataset D

Evaluation
We consider the following terms in evaluating the performance of the proposed EC4.5.
(a) TN (True Negative) is the number of correct predictions that an instance is invalid.(b) FP (False Positive) is the number of incorrect predictions that an instance is valid.(c) FN (False Negative) is the number of incorrect prediction that an instance is invalid.(d) TP (True Positive) is the number of correct predictions that an instance is valid.
Also, the following performance measure was used to test the performance of the proposed EC4.5.Accuracy is the proportion of the total number of predictions that were correct: Precision is the proportion of the predicted valid instances that were correct: Recall is the proportion of the valid instances that were correctly identified: F-Measure is derived from precision and recall values: The F-Measure is used because despite the Precision and Recall values are valid metrics in their own right, one of them can be optimized at the expense of the other.The F-Measure only produces a high result when Precision and Recall are both balanced and significant.
The classification of the target is "Should we play basketball?"The answer can be either yes or no.The weather attributes which include outlook, temperature and humidity take the following values: Outlook = {Sun, Overcast, Rain} Temperature = {Hot, Sweet, Cold} Humidity = {High, Normal} So, using the three now sets: the information gain (ID3), the gain ratio (C4.5) and the E-gain ratio (EC4.5) are calculated for the outlook based on temperature and humidity as shown in the appendix.

RESULTS AND DISCUSSION
In the experiment, the values of the gain ratio (C4.5) and E-gain ratio (EC4.5) were first used to calculate the probability of uncertainty of some selected attributes with the highest instances.The outcome is shown in detail in Table 3. Figure 2 displays that EC4.5 is the optimal algorithm which has the lowest probability of uncertainty on all attributes and C4.5 has the highest probability of uncertainty.The detailed classification accuracies suggest that EC4.5 outperformed C4.5, because it has a lower FP rate of 0.003 and a TP rate of 0.994 which was used to calculate the accuracy using the performance metrics.Thus, EC4.5 is the optimal model of classification algorithm in this paper.Figure 3 shows that EC4.5 has the highest accuracy of 99.40% with an error rate of 0.60%, while C4.5 has an accuracy of 51.27% with an error rate of 48.73%.Figure 4 shows the detailed results of the compared algorithms, with C4.5 having the highest value over EC4.5; under precision C4.5 has 0.513 and EC4.5 has 0.994; under recall C4.5 has 0.513 and EC4.5 has 0.994; and lastly under F1 measure C4.5 has 0.678 and EC4.5 has 0.994.The overall result suggested that EC4.5 is the optimal algorithm compared to C4.5.

CONCLUSIONS
This paper proposed a modified model (EC4.5).The proposed modification offers solutions to the limitations associated with C4.5 in terms of presenting an equivalent result with ID3 when the same number of attributes is used.After testing the two classifiers (C4.5 and EC4.5), the result of the experiment shows that EC4.5 outperformed, with an accuracy of 99.40%, whereas C4.5 has an accuracy of 51.27%.Based on the result of this research, EC4.5 was selected as the optimal algorithm.Future work should consider a hybrid approach to handle multi-dimensional data with large intervals using EC4.5 algorithm.

Figure 1 .
Figure 1.Outcome of the 3 classification algorithms.From Figure1, the three classification algorithms ID3, C4.5 and EC4 have the following outcome: outlook with 5,4,5 attributes shows that ID3 has a value of 0.247, C4.5 has a value of 0.157 and EC4.5 has a value of 0.112.Subsequently, temperature with 6,4,6 attributes shows that ID3 has a value of 0.029, C4.5 has a value of 0.019 and EC4.5 has a value of 0.013.However, humidity which has the same number of attributes of 7,7 leads to ID3 and C4.5 having the same value of 0.152.EC4.5 shows an improvement by having the value 0.092 which reduces the number of uncertainties in C4.5.

Figure 2 .
Figure 2. Probability of uncertainty outcome of gain ratio and E-gain ratio.
= Best attribute according to the above-computed criteria 9. Tree = Create a decision node that tests   in the root 10.   = Induced Sub-dataset from D based on   11. for all   do 12.  = C4.5 (  ) 13. Attach   to the corresponding branch of the Tree () ∑ (  * (  ))  =1β Attach   to the corresponding branch of the Tree 14. end for 15. return Tree = Best attribute according to the above computed criteria 9. Tree = Create a decision node that tests   in the root 10.  = Induced Sub-dataset from D based on   11. for all   do 12.  = EC4.5 (  ) 13.

Table 3 .
Probability of uncertainty outcome of gain ratio and E-gain ratio.
Furthermore, C4.5 and EC4.5 classification algorithm, were trained and tested on the same dataset; the measures used for the algorithm performance evaluation were accuracy, precision, recall and F1 measure.Table4illustrates the detailed results of the two classification algorithms.