Hybrid decision tree and naïve Bayes classifiers for multi-class classification tasks

https://doi.org/10.1016/j.eswa.2013.08.089Get rights and content

Highlights

  • The hybrid decision tree is able to remove noisy data to avoid overfitting.

  • The hybrid Bayes classifier identifies a subset of attributes for classification.

  • Both algorithms are evaluated using 10 real benchmark datasets.

  • They outperform traditional classifiers in challenging multi-class applications.

Abstract

In this paper, we introduce two independent hybrid mining algorithms to improve the classification accuracy rates of decision tree (DT) and naïve Bayes (NB) classifiers for the classification of multi-class problems. Both DT and NB classifiers are useful, efficient and commonly used for solving classification problems in data mining. Since the presence of noisy contradictory instances in the training set may cause the generated decision tree suffers from overfitting and its accuracy may decrease, in our first proposed hybrid DT algorithm, we employ a naïve Bayes (NB) classifier to remove the noisy troublesome instances from the training set before the DT induction. Moreover, it is extremely computationally expensive for a NB classifier to compute class conditional independence for a dataset with high dimensional attributes. Thus, in the second proposed hybrid NB classifier, we employ a DT induction to select a comparatively more important subset of attributes for the production of naïve assumption of class conditional independence. We tested the performances of the two proposed hybrid algorithms against those of the existing DT and NB classifiers respectively using the classification accuracy, precision, sensitivity-specificity analysis, and 10-fold cross validation on 10 real benchmark datasets from UCI (University of California, Irvine) machine learning repository. The experimental results indicate that the proposed methods have produced impressive results in the classification of real life challenging multi-class problems. They are also able to automatically extract the most valuable training datasets and identify the most effective attributes for the description of instances from noisy complex training databases with large dimensions of attributes.

Introduction

During the past decade, a sufficient number of data mining algorithms have been proposed by the computational intelligence researchers for solving real world classification and clustering problems (Farid et al., 2013, Liao et al., 2012, Ngai et al., 2009). Generally, classification is a data mining function that describes and distinguishes data classes or concepts. The goal of classification is to accurately predict class labels of instances whose attribute values are known, but class values are unknown. Clustering is the task of grouping a set of instances in such a way that instances within a cluster have high similarities in comparison to one another, but are very dissimilar to instances in other clusters. It analyzes instances without consulting a known class label. The instances are clustered based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity. The performance of data mining algorithms in most cases depends on dataset quality, since low-quality training data may lead to the construction of overfitting or fragile classifiers. Thus, data preprocessing techniques are needed, where the data are prepared for mining. It can improve the quality of the data, thereby helping to improve the accuracy and efficiency of the mining process. There are a number of data preprocessing techniques available such as (a) data cleaning: removal of noisy data, (b) data integration: merging data from multiple sources, (c) data transformations: normalization of data, and (d) data reduction: reducing the data size by aggregating and eliminating redundant features.

This paper presents two independent hybrid algorithms for scaling up the classification accuracy of decision tree (DT) and naïve Bayes (NB) classifiers in multi-class classification problems. DT is a classification tool commonly used in data mining tasks such as ID3 (Quinlan, 1986), ID4 (Utgoff, 1989), ID5 (Utgoff, 1988), C4.5 (Quinlan, 1993), C5.0 (Bujlow, Riaz, & Pedersen, 2012), and CART (Breiman, Friedman, Stone, & Olshen, 1984). The goal of DT is to create a model that predicts the value of a target class for an unseen test instance based on several input features (Loh and Shih, 1997, Safavian and Landgrebe, 1991, Turney, 1995). Amongst other data mining methods, DTs have various advantages: (a) simple to understand, (b) easy to implement, (c) requiring little prior knowledge, (d) able to handle both numerical and categorical data, (e) robust, and (f) dealing with large and noisy datasets. A naïve Bayes (NB) classifier is a simple probabilistic classifier based on: (a) Bayes theorem, (b) strong (naïve) independence assumptions, and (c) independent feature models (Farid et al., 2011, Farid et al., 2010, Lee and Isa, 2010). It is also an important mining classifier for data mining and applied in many real world classification problems because of its high classification performance. Similar to DT, the NB classifier also has several advantages such as (a) easy to use, (b) only one scan of the training data required, (c) handling missing attribute values, and (d) continuous data.

In this paper, we propose two hybrid algorithms respectively for a DT classifier and a NB classifier for multi-class classification tasks. The first proposed hybrid DT algorithm finds the troublesome instances in the training data using a NB classifier and removes these instances from the training set before constructing the learning tree for decision making. Otherwise, DT may suffer from overfitting due to the presence of such noisy instances and its accuracy may decrease. Moreover, it is also noted that to compute class conditional independence using a NB classifier is extremely computationally expensive for a dataset with many attributes. Our second proposed hybrid NB algorithm finds the most crucial subset of attributes using a DT induction. The weights of the selected attributes by DT are also calculated. Then only these most important attributes selected by DT with their corresponding weights are employed for the calculation of the naïve assumption of class conditional independence. We evaluate the performances of the proposed hybrid algorithms against those of existing DT and NB classifiers using the classification accuracy, precision, sensitivity–specificity analysis, and 10-fold cross validation on 10 real benchmark datasets from UCI (University of California, Irvine) machine learning repository (Frank & Asuncion, 2010). The experimental results prove that the proposed methods have produced very promising results in the classification of real world challenging multi-class problems. These methods also allow us to automatically extract the most representative high quality training datasets and identify the most important attributes for the characterization of instances from a large amount of noisy training data with high dimensional attributes.

The rest of the paper is organized as follows. Section 2 gives an overview of the work related to DT and NB classifiers. Section 3 introduces the basic DT and NB classification techniques. Section 4 presents our proposed two hybrid algorithms for the multi-class classification problems respectively based on DT and NB classifiers. Section 5 provides experimental results and a comparison against existing DT and NB algorithms using 10 real benchmark datasets from UCI machine learning repository. Finally, Section 6 concludes the findings and proposes directions for future work.

Section snippets

Related work

In this section, we review recent research on decision trees and naïve Bayes classifiers for various real world multi-class classification problems.

Supervised classification

Classification is one of the most popular data mining techniques that can be used for intelligent decision making. In this section, we discuss some basic techniques for data classification using decision tree and naïve Bayes classifiers. Table 1 summarizes the most commonly used symbols and terms throughout the paper.

The proposed hybrid learning algorithms

In this paper, we have proposed two independent hybrid algorithms respectively for decision tree and naïve Bayes classifiers to improve the classification accuracy in multi-class classification tasks. These proposed algorithms are described in the following Sections 4.1 The proposed hybrid decision tree algorithm, 4.2 The proposed hybrid algorithm for a naïve Bayes classifier. Algorithm 1 is used to describe the proposed hybrid DT induction, which employs a NB classifier to remove any noisy

Experiments

In this section, we describe the test datasets and experimental environments, and present the evaluation results for both of the proposed hybrid decision tree and naïve Bayes classifiers.

Conclusions

In this paper, we have proposed two independent hybrid algorithms for DT and NB classifiers. The proposed methods improved the classification accuracy rates of both DT and NB classifiers in multi-class classification tasks. The first proposed hybrid DT algorithm used a NB classifier to remove the noisy troublesome instances from the training set before the DT induction, while the second proposed hybrid NB classifier used a DT induction to select a subset of attributes for the production of

Acknowledgment

We appreciate the support for this research received from the European Union (EU) sponsored (Erasmus Mundus) cLINK (Centre of Excellence for Learning, Innovation, Networking and Knowledge) project (Grant No. 2645).

References (35)

Cited by (339)

  • ERP adoption prediction using machine learning techniques and ERP selection among SMEs

    2024, International Journal of Business Performance Management
View all citing articles on Scopus
View full text