Unsupervised Discretization : An Analysis of Classification Approaches for Clinical Datasets

Discretization is a frequently used data preprocessing technique for enhancing the performance of data mining tasks in knowledge discovery from clinical data. It is used to transform the real-world quantitative data into qualitative data. The aim of this study is to present an experimental analysis of the variation in performance of two trivial unsupervised discretization methods with respect to different classification approaches. Equal width discretization and equal frequency discretization methods are applied for four benchmark clinical datasets obtained from the University of California, Irvine, machine learning repository. Both the methods were applied for transforming quantitative attributes into qualitative attributes with three, five, seven and ten intervals. Six classification approaches were evaluated using four evaluation measures. From the results of this experimental analysis, it can be observed that there is a variation in the performance of classification algorithms. Accuracy of classification varies with respect to the discretization method used and also with respect to the number of intervals of discretization. Moreover it can be inferred that different classification approaches require different discretization methods. No method can be deemed to be ‘the best-suitable’ for all applications; hence the choice of an appropriate discretization method depends on data distribution, data interpretability, correlation, classification performance and domain of application.


INTRODUCTION
Data mining is one of the emerging research areas in computer science and information technology.It is a process of extracting patterns, useful information or trends, from retrospective, massive and multidimensional data.Some application areas of data mining techniques for knowledge extraction include business, academics and medicine.Generally, clinical decisions on medical data are often made based on doctor's perception and experience rather than on the knowledge hidden in the database.This might lead to bias, errors and excessive medical costs which affects the quality of service provided to patients.Therefore, Knowledge Discovery in Databases (KDD) is commonly used to improve the quality of service.Integration of KDD process with medical data could reduce medical errors, provide clinical decision support and improve the diagnostic process.Data mining is an important step in KDD and is used for various aspects in the medical domain such as diagnosis, prognosis and decision support (Christopher et al., 2015;Jane et al., 2016;Nahato et al., 2015;Susmi et al., 2015;Sweetlin et al., 2016).KDD involves the process of finding and interpreting knowledge from data which is described by the following steps: 1)understanding of domain 2) data set selection, 3) data cleaning and preprocessing, 4) data reduction and projection, 5) matching the objective into a data mining method (association rule mining, classification, clustering, regression etc.,), 6) choice of the algorithm for pattern searching, 7) searching for pattern of interest (data mining),8) data interpretation and 9) use of the discovered knowledge (Fayyad et al., 1996).Most prior work on KDD focuses on step 7, the data mining step.Data mining applications often involve quantitative data.However many learning algorithms are intended to handle qualitative data (Kohavi and Sahami, 1996).Algorithms that directly deal with quantitative data, learning is less efficient and less effective (Richeldi and Rossotto, 1995).In many machine learning techniques we need to transform such quantitative data into qualitative data.This process is called data discretization.Data discretization refers to partitioning the data into discrete set of intervals.Each interval is treated as a category.
Data discretization simplifies the original data and also improves the efficiency of prediction.It has several advantages in machine learning and data mining tasks.In particular, it increases the understandability of the classification models that uses rule sets (Liu et al., 2002;Fu, 2011).It also reduces the computation time needed for processing the continuous data by dividing data into reduced set of intervals (Mittal and Cheong, 2002).Maslove et al. (2013) have evaluated six discretization methods: two supervised methods (minimum descriptive length-based and ChiMerge), three unsupervised methods (equal width, equal frequency and K-means) and one method specific to clinical data with both supervised and unsupervised components (reference range based).They have examined the impact of discretization on three evaluation parameters: accuracy, consistency and simplicity.To evaluate the six discretization methods for accuracy, each of the discretization methods are examined with decision tree and naïve-bayes classification approach.They have evaluated the discretization methods for consistency by deriving the inconsistency count for each discretization experiment.For evaluating simplicity, they count the number of nodes in each decision tree generated by each of the discretization methods.For the evaluation of discretization methods, they use both laboratory data and physiologic data derived from adult patients in the intensive care unit.From the result, they observed that supervised methods were more accurate than unsupervised.Among the supervised methods, equal frequency and K-means performed well.Yang and Webb (2009) have proved that discretization is an effective technique for probabilitybased learning.In their study it was inferred that, the effectiveness of discretization in naïve-bayes learning has impact on the performance of naïve-bayes classifiers.They make use of classification error as a performance measure for naïve-bayes classifier.In order to minimize the classification error, they analyze two factors with respect to discretization: 1) Decision boundaries and 2) the error tolerance of probability estimation for each quantitative attribute.From the analysis they conclude that discretization with these factors can affect the classification bias and variance of the classifiers.The effects are named as discretization bias and discretization variance.To manage the discretization bias and variance, they use the concepts called interval frequency and interval number.Moreover, they propose two efficient unsupervised discretization methods called proportional discretization and fixed frequency discretization for managing discretization bias and variance.They evaluate these two methods against four discretization methods for naïve-bayes classifier on 29 benchmark datasets from UCI machine learning repository.The results have demonstrated that the new proposed discretization methods reduce naïve-bayes classification error when compared to current established discretization methods.
This study focuses on two unsupervised discretization techniques: Equal width Discretization and Equal Frequency Discretization.Continuous-valued attributes are discretized into several intervals and the classification performances of five classification approaches are analyzed.The novel observations and findings of the experimental analysis can serve as guiding principles for preprocessing of clinical data.

MATERIALS AND METHODS
The clinical datasets used in this experimental study were selected from the University of California Irvine (UCI) Machine Learning repository.Datasets which contain categorical, discrete and continuous data were chosen.The list of datasets is presented in Table 1.The description about the Cleveland Heart Disease (CHD) dataset, Chronic Kidney Disease (CKD) dataset, Pima Indians Diabetes (PID) dataset and BUPA Liver Disorder (BLD) dataset are presented in Table 2 to 5 respectively.In particular, the PID dataset consists the details of 768 Pima Indian Women.
The continuous-valued attributes in these datasets were discretized using Equal width discretization and equal frequency discretization methods.The former method divides the continuous-valued feature 'f ' into k intervals of equal width, where k is a user-defined parameter.Thus each interval has a width (w), where w = (max-min) /k and interval boundaries are min+w, min+2w, ... , min+(k-1)w.The latter method divides the range of continuous-valued feature into k equally sized bins.Each interval contains approximately same number of instances, where k is a user-defined The former is used for obtaining the classifier using an induction algorithm and the latter is used for evaluating the performance of the classifier using performance evaluation measures.Cross-Validation (CV) with 'k' folds is a technique whereby the dataset 'D', is randomly split into k folds of approximately equal size.The classifier (model) is trained and tested k times.Each time (k-1) folds are used for training and the remaining one fold is used for testing.In classification, k-fold cross-validation is the best method to use for validating and selecting a classifier (Kohavi, 1995).Associative classifier (CBA), Decision tree classifier (C4.5),Support Vector Machine (SVM), Multi-Layer Perceptron classifier (MLP), Naïve Bayes classifier (NB) and k-Nearest Neighbour classifier (kNN) are validated (Han and Kamber, 2006).
In this experimental study, six trivial classification approaches were used.Each approach differs from the other in two aspects: first, the induction (learning) algorithm used for training the classifier; and second, the knowledge-representation form used to represent the classification model.The six classification approaches are as follows: first, a decision tree classifier (Quinlan, 1986), induced (trained) using the C4.5 algorithm is used.The classifier (knowledge model) is represented in the form of a tree; second, the naïve Bayes classifier uses a probabilistic induction approach and the knowledge model is represented in the form of probabilistic values; third, the Class-Based Associative (CBA) (Liu et al., 1998) classifier uses an Apriori-based (Agrawal and Srikant, 1994) classification rule induction approach and the knowledge model is represented in the form of IF-THEN associative classification rules; fourth, the Multilayer Perceptron (MLP) (Rosenblatt, 1958) is induced using a gradient descent-based backpropagation algorithm and the knowledge is represented by a trained feed-forward Neural Network; fifth, the Support Vector Machine (Boser et al., 1992) is induced using the Sequential Minimal Optimization (SMO) algorithm and the knowledge model is represented in the form of support vectors and the separating hyper planes; sixth, the K-NN classifier trained using distance-based approach and the classifier is represented in terms of distance measures from neighboring instances.The choice of a classification approach and an appropriate classifier depends on the need and purpose of the classifier for that domain of application.Moreover, factors such as data distribution, entropy of discretization may also be considered.
In this experimental study, four performance evaluation measures were used.The four measures namely, Sensitivity, Specificity, Fmeasure and Accuracy differ in their evaluation focus.Sensitivity is used to evaluate the effectiveness of a classifier to identify positive labels whereas Specificity evaluates how effectively a classifier identifies negative labels.Fmeasurerelates between data's positive labels and those given by a classifier based on per-class average and finally Accuracy evaluates the overall classification efficiency of the classifier.

RESULTS AND DISCUSSION
The evaluation of classification performance of six classification approaches for equal width discretization and equal frequency discretization is presented in Table 6.A discussion on the observations, findings and important inferences are presented below.
For the PID dataset, bayes classifier achieves the highest accuracy of 76.307% for EW discretization with 7 intervals whereas the bayes classifier with 7 intervals for EF discretization yields 73.96%.The highest accuracy for EF discretization for the PID dataset is achieved by C4.5 algorithm (74.867%).Though entropy of the partitions (intervals) are proportional to the number of partitions, a drop in classification accuracy for increase in the number of partitions can be inferred.This accuracy-drop is due to the intercorrelation between the attribute-subset and also the correlation between the attribute and the class attribute.A diminish in the former and a rise in the latter is preferred.
A change in the choice of the attribute selection order or the attribute-subset, for the construction of a decision tree, may result in a variation in classification performance.For example, the highest classification accuracy for EF discretization, for the BLD dataset was achieved by the C4.5 classifier trained using 3 intervals.Moreover, the increase in the number of intervals enhanced the information gain of the individual attributes.But during tree construction, the attributesubsets for lower levels of the trees yields different combination of attributes; different combination of attributes in the attribute-subsets differ in the level of inter-correlation.Hence a fall in accuracy for EF 10interval can be observed.

Table 1 :
Thus each interval contain n/k values, where 'n' is the total number of instances (records) in the dataset.The discretized data is split into training and testing data.