Fuzzy Discretization based Classification of Medical Data

Discretization is one of the commonly used data preprocessing technique to improve the efficiency of the knowledge extraction process on clinical data. Generally, clinical data contains numeric attributes with continuous values. Data discretization simplifies the original data by transforming continuous data attribute values into a finite set of intervals. Although discretization is capable of handling continuous attributes on clinical data, there are cases where discretization is not an appropriate technique for handling continuous attributes. There are instances where attribute values are vague, imprecise and have multiple distributions with different classes, which challenges the process of mining in clinical data. Hence, there is a need for fuzzy discretization to pre-process the clinical data before mining. The aim of this study is to derive fuzzy discretization from crisp-interval discretization using geometric approach for constructing fuzzy sets, where overlapping region between the fuzzy sets is represented as geometric area. This study comprises of three steps: First, non-overlapping fuzzy sets are constructed using intervals generated from crisp-interval discretization. Second, area of overlapping between the fuzzy sets is computed based on the geometric approach and an average area of overlapping is estimated. Third, fuzzy sets are redesigned based on the estimated average area of overlapping. Fuzzy discretizations for three, five and seven intervals have been examined using Pima Indian Diabetes dataset (PID) and Bupa Liver Disorder dataset (BLD) taken from the University of California Irvine machine learning repository. The variation in performance of crisp and fuzzy discretization methods is measured using six classification approaches namely, tree based approach, probabilistic induction based approach, rule-based approach, network learning approach, kernel-based approach and distancebased approach and a rule-based fuzzy inference system. The results show that the classification accuracy remains stable with less deviation across different classifiers with varying intervals.


INTRODUCTION
Electronic Medical Records (EMR) stores enormous clinical data that describes the health care examinations undergone by the patients.These data contain hidden knowledge that can be used in developing Clinical Decision Support Systems (CDSSs).The CDSS assists the clinician in the decision making activities such as diagnosis, prognosis and treatment.The process of Knowledge Discovery in Databases (KDD) plays a vital role in extracting the knowledge from clinical data.Data pre-processing and data mining is an important step in KDD.Data preprocessing is the task of improving the quality of data for mining process.The task includes data cleaning, data integration, data transformation and data reduction.Data reduction methods comprises of numerosity reduction, dimensionality reduction and data discretization.Data discretization plays an important role in obtaining cognitively relevant human interpretation and in speeding up the computation process with a reduced level of data (Mittal and Cheong, 2002;Russell and Norvig, 1995).
Data discretization converts the continuous valued attributes into a range of discrete intervals.This conversion can also affect the performance of predictors and classifiers (Kianmehr et al., 2008;Ishibuchi et al., 2001).There has been several works in the literatures that emphasize the importance of performing data discretization before mining process (Maslove et al., 2013;Mittal and Cheong, 2002).Based on the learning approach, data discretization methods have been classified into two types namely, Supervised and Unsupervised (Dougherty et al., 1995).Supervised discretization is possible only when the class information is present in the database.If no class information is available, unsupervised discretization is preferred.Shanmugapriya et al. (2016b) have used unsupervised interval discretization methods in their study and have applied to four medical data sets namely Cleveland Heart Disease (CHD) data set, Chronic Kidney Disease (CKD) data set, Pima Indians Diabetes (PID) data set and BUPA Liver Disorder (BLD) data set.The performance of the various classifiers such as C4.5 decision tree, Support Vector Machine (SVM), K-Nearest Neighbor (K-NN), Classification Based on Association rules (CBA), Bayes and Multilayer Perceptron (MLP)are analyzed by varying intervals.Normally, interval discretization has been used for handling continuous attributes in many machine learning techniques such as decision trees, Bayesian networks and association rule mining (Quinlan, 1996).
Even though interval discretization is enough to handle the continuous attributes, it may not be appropriate in situations, where there is no clear boundary to set the interval limits.Since fuzzy logic is a convenient and well known tool for handling continuous attributes with unclear boundary (Zeinalkhani and Eftekhari, 2014;Zimmermann, 1996), in this study, fuzzy based discretization has been used to discretize the data with continuous attributes.Moreover, in Clinical Decision Support System (CDSS), there is a need for human reasoning in order to handle continuous attributes (Pal et al., 2012).Fuzzy logic can better represent the continuous attributes in human understandable manner.To handle vague and imprecise continuous attributes in the data set, fuzzy discretization is performed on the dataset (Mehta et al., 2009).There are several works on deriving fuzzy discretization from interval discretization (Roy and Pal, 2003;Zeinalkhani and Eftekhari, 2014).Ishibuchi and Yamamoto (2003) have examined two methods of generating fuzzy discretization from interval based discretization.The first method was based on trapezoidal membership function, which is a linear function.The second method extended the trapezoidal membership function to piecewise linear function.In both methods, the overlap grades were assigned based on the parameters of adjacent membership functions.For experimentation, they used three interval discretization methods namely equal width intervals, equal-frequency intervals and minimum entropy intervals.Their proposed approach was tested using wine data set with 13 continuous valued attributes and sonar dataset with 60 continuous valued attributes by varying overlapping grades.The datasets were collected from the University of California Irvine (UCI) Machine Learning repository.The discretized data sets were classified using fuzzy rule-based classifier.From the results, it was observed that classification ability was increased for some cases and degraded for other cases by the increase in the overlap grades.Zeinalkhani and Eftekhari (2014) proposed a twostep algorithm to generate the membership functions.In the first step, discretization algorithm divides the domain of attributes to several partitions and in the second step, a membership function is defined on each partition.The generated partitions were transformed into fuzzy membership functions.Transformations were performed based on four approaches: First approach was based on partition width, second one was based on standard deviation of examples inside the partition, third one was based on Neighbor Partition Coverage Rate (NPCR) and the last one was based on Partition Coverage Rate (PCR).Furthermore, they proposed a membership function generation algorithm, called Fuzzy Entropy Based Fuzzy Partitioning (FEBFP).For experimentation, they considered several discretization methods including equal width and equal frequency.Datasets were taken from UCI machine learning repository and also from Knowledge Extraction based on Evolutionary Learning (KEEL) dataset repository (Alcalá-Fdez et al., 2011).Their experimental result shows that membership functions defined by partition coverage rate and partitions generated by Zeta discretization algorithm performed well.Ishibuchi et al. (2001) compared fuzzy discretization with standard non-fuzzy discretization using fuzzy rule-based system.For the experimentation they used wine data set taken from UCI machine learning repository.Wine data set is a three-class pattern classification problem with 178 patterns and 13 continuous attributes.In fuzzy discretization, they discretized each attribute of wine data set into fuzzy sets with linguistic terms, where each fuzzy set is characterized using triangular membership function.They have designed the fuzzy sets based on the following two conditions: 1) The sum of neighboring membership functions is always 1 and 2) Crossing points of neighboring membership functions coincide with the threshold values in the interval discretization.They generated the linguistic rules using linguistic terms.For the generation of linguistic rules, they have extended the definition of basic rule selection criteria such as support and confidence.For non-fuzzy discretization, they used entropy based discretization method.They compared fuzzy and non-fuzzy discretization approaches using fuzzy rule-based system on wine dataset.From the result, it was observed that higher classification accuracy (95%) was obtained using fuzzy discretization.For non-fuzzy discretization, they observed only 93% accuracy.Fazzolari et al. (2014) presented a multi-objective evolutionary method to improve accuracyinterpretability trade-off of fuzzy rule-based classification systems.This method works in three stages namely fuzzy discretization, rule base extraction and concurrent tuning of both membership functions in database and the selection rules in the rule base.In the first stage, fuzzy discretization algorithm has been designed to generate suitable granularities for defining the initial fuzzy partitions of the database.In the second stage, rule base associated to the fuzzy partitions (obtained in the first step) was created by extracting candidate fuzzy association rules.In the final stage, multi-objective evolutionary algorithm was designed to perform the tuning of membership functions concurrently in the database and the selection of rules in the rule base.The proposed method was tested with 35 datasets taken from KEEL dataset repository, including small size datasets and high dimensional and large scale datasets.The obtained results show that the multi-objective evolutionary approach improves the precision, with respect to the results obtained using a single-objective approach.The knowledge of the domain expert is utilized to design fuzzy sets in most of the existing works.To overcome this dependency on an expert, in this study a geometric approach is used for designing the fuzzy set.Geometric representation is preferred as human reasoning can be better represented.
This study presents a geometric approach using SimE, for deriving fuzzy discretization from Equal Width (EW) interval discretization method (Shanmugapriya et al., 2016a).The proposed approach was tested with two medical datasets namely Pima Indians Diabetes dataset and Bupa Liver Disorder Dataset with three intervals.Fuzzy discretization is derived from interval discretization (EW) in three steps: In the first step, data sets are discretized into intervals using equal width discretization method.In second step, fuzzy sets are created using the boundary values of each interval derived from the equal width discretization method.The adjacent fuzzy sets will have no similarity (overlapping area) between them because it is derived from crisp intervals.Setnes et al. (1998) has suggested that fuzzy sets with optimal overlapping area can improve the semantic representation and performance of any fuzzy based system.Since, an area of overlapping between the adjacent fuzzy sets is preferred, it is estimated by investigating many studies (Allahverdi, 2009;Muthukaruppan and Er, 2012;Samuel et al., 2013) on fuzzy classification of medical data.
In third step, fuzzy sets created in step two are redesigned with the estimated area of overlapping using SimE.The proposed approach is evaluated through fuzzy rule-based classification for the considered intervals on Pima Indian Diabetes dataset and Bupa Liver Disorder dataset.

MATERIALS AND METHODS
Two Clinical datasets taken from the University of California Irvine machine learning repository have been used for this experimentation: Pima Indian Diabetes (PID) dataset with 9 attributes and 768 instances, Bupa Liver disorder (BLD) dataset with 7 attributes and 345 instances.Each dataset contains discrete, categorical and continuous attributes.Table 1 shows the description, type and range of the attributes in PID dataset.It has six continuous attributes, one discrete attribute and one categorical attribute for representing class label (presence and absence of the diabetes).Table 2 depicts the description, type and range of the attributes in BLD dataset.It includes five continuous attributes and one categorical attribute for representing the class label.(presence and absence of the liver disorder).Fuzzy set: Fuzzy set is a set whose element has a degree of membership.A fuzzy set A in X is a set of ordered pairs (Zadeh, 1965): where, X is called universe of discourse and µ ሺ‫ݔ‬ሻ: X → ሾ0,1ሿ is the membership function which maps each element ‫ݔ‬ of X to a value between 0 to 1. Generally, membership functions are identified and designed by the domain expert (Bera et al., 2014).In this study, triangular membership function is used for characterizing the fuzzy sets.The triangular membership function for a fuzzy set X is defined using Eq. ( 2) (Kaufmann, 1975;Klir and Yuan, 1991): where, ‫ݔ‬ is an element of a fuzzy set ܺ, a and c represents the lower and the upper boundary of fuzzy set ܺ and ܾ represents the center of the fuzzy set ܺ.
Fuzzy set similarity: Similarity is a measure of approximate equality between the fuzzy sets.The similarity measure of fuzzy sets has been applied in many fields such as classification, clustering, image processing, fuzzy reasoning and decision making (Setnes et al., 1998;Zwick et al., 1987).Different kinds of similarity measures are proposed in literatures (Pappis and Karacapilidis, 1993).In most of the existing works researchers have estimated the similarity based on elements of the sets.Shanmugapriya et al. (2016a) in their previous work they have proposed an algorithm called Similarity Estimator (SimE), for estimating the similarity between fuzzy sets using a geometric approach.

SYSTEM FRAMEWORK
The proposed method has four phases namely, the Interval Discretization, Fuzzy Discretization, Fuzzy rule-based Classification and Performance Analysis.The framework of the proposed method is given in Fig. 1.The details of each phase are explained in the following sub sections: Interval discretization: In the first phase, continuous data of PID and BLD dataset are discretized into k equal sized intervals (I 1 , I 2 ..., I k ) using EW discretization method.Each interval I k is denoted by its lower limit l k and upper limit u k as I k = [l k, u k ].The width of an intervalሺwሻ can be computed using the Eq. ( 3), ( 4) and ( 5) respectively (Liu et al., 2002): where, ܸ ௫ and ܸ are the maximum and minimum values of an attribute ܸ, ‫ݒ‬ ϵ V, ݅ = ሼ1,2,3 . .݊}; ݊ is the number values in each attribute; ݇ is the number of cut points specified by the user.In this study, three values of ݇ have been examined: ݇ = {3, 5, 7}.The ݇ + 1 cut points are ܸ + ‫,ݓ‬ ܸ + ‫,ݓ2‬ ..., ܸ + ሺ݇ − 1ሻ‫.ݓ‬Non overlapping intervals are obtained in this phase.
There is no overlapping between the intervals boundaries.

Fuzzy discretization:
In this phase, fuzzy discretization is obtained from the interval discretization in three steps: Step 1: In this step, fuzzy sets (A 1 A 2 ..., A k ) are constructed for each attribute of the PID and BLD datasets.Triangular membership function (Kaufmann, 1975) [93,103,98].
Step 2: Fuzzy sets generated in step 1 have no overlapping area because it is derived from crisp-intervals.In this step, to design the fuzzy sets with overlapping area, an average area of overlapping between the fuzzy sets is estimated.This estimation is arrived after investigating many studies (Allahverdi, 2009;Muthukaruppan and Er, 2012;Samuel et al., 2013) on fuzzy classification of medical data.
Step 3: In this step, the fuzzy sets obtained in step1 are redesigned with the estimated area of overlapping using SimE algorithm.SimE computes the area of overlapping by partitioning the region of overlapping into geometric structures and summing the area of resulting polygons.To obtain the area of overlapping with an estimated value, fuzzy sets are redesigned by adjusting the parameters of triangular membership function (a, b and c).This step results overlapping fuzzy sets.
Figure 3 shows the overlapping fuzzy sets of mean corpuscular volume (mcv) attribute of BLD data set obtained after redesigning.The fuzzy set 'Small' is defined with the values [0, 82, 39], the fuzzy set

Fuzzy rule-based classification:
In this phase, Mamdani-type Fuzzy Inference System (FIS) is used as fuzzy rule-based classification system for classifying the presence or absence of a disease.Fuzzy inference system is modeled in the following four steps: Step1: Fuzzification: This process involves the transformation of all the input attributes into the corresponding fuzzy sets with linguistic terms using the function defined in Eq. ( 2).Inputs of the fuzzy inference system are the generated fuzzy sets, values of the attributes and the rule set.In this study, the proposed geometric approach for fuzzy discretization is used to generate the fuzzy sets (Fuzzified output).
Step 2: Fuzzy rule set generation: Rule set is created by including all possible combinations of attributes and classes.The rule set is characterized by a set of IF-THEN rules in which the antecedents and the consequents involve linguistic terms.In this study, fuzzy rules are generated by defining the crisp rules with linguistic terms.Discretized data obtained from the interval discretization are given to the Partial Decision Tree (PART) algorithm for generating crisp rules (Exarchos et al., 2012).
Step 3: Fuzzy inference: This process maps a given fuzzy input to a fuzzy output using the rules contained in the rule set.This mapping provides a basis from which decisions can be made.The inference process receives its inputs from fuzzification process and the rule set.This is obtained by performing the following steps as discussed by Rajasekaran and Vijayalakshmi Pai (2007): Step 3.1: Apply fuzzy AND operator on the antecedent part of the rule.
Step 3.2: Analyze the implication from antecedent to consequent, using the rules in the rule set.
Step 3.3: Aggregate the consequents across the rules into single output.
Step 4: Defuzzification: Fuzzy inference process returns the inference value of an instance.In this step, the inference value is mapped into crisp output using Mean of Maximum (MoM) defuzzification method (Naaz et al., 2011).Accuracy of fuzzy-rule based classification is computed based on the defuzzified value.The above steps are performed for each interval discretization (3, 5 and 7).
Step 5: The steps one through four are repeated for each interval discretization (3, 5 and 7).

Performance analysis:
The performance of the EW interval discretization method with fuzzy discretization method is analyzed and compared using six traditional classifiers namely Associative classifier (CBA), Decision tree classifier (C4.5),Support Vector Machine (SVM), Multi-layer Perceptron classifier (MLP), Naïve Bayes classifier (NB) and k-Nearest Neighbour classifier (kNN) and a rule-based Fuzzy Inference System (FIS) by varying the discretization intervals namely three, five and seven.Performance evaluation parameters namely, Classification Accuracy ‫,)ܣܥ(‬ Sensitivity (ܵܰ) and Specificity (ܵܲ) are computed using Eq.( 6), Eq. ( 7) and Eq. ( 8) respectively: where, ܶܲ, ܶܰ, ‫ܲܨ‬ and ‫ܰܨ‬ represent the true positives, true negatives, false positives and false negatives respectively.Sensitivity measures the proportion of positives that are correctly identified as positive.Specificity measures the proportion of negatives that are correctly identified as negative.Accuracy measures the proportion of the total number of predictions that are correct.

RESULTS AND DISCUSSION
This study is implemented using MATLAB R2013a.The proposed approach is tested with two datasets namely Pima Indian Diabetes dataset and Bupa Liver Disorder dataset.All the continuous attributes in the data set are fuzzified using the proposed fuzzy discretization approach.The performance of the fuzzy discretization approach is evaluated using fuzzy rulebased classification system.Fuzzy toolbox available in MATLAB R2013a is used for building fuzzy inference system.The data is split into training (75% of the data) and testing data (25% of the data).
Fuzzy inference system is modeled using training data and it is tested using test data.For each dataset (PID and BLD) and for each discretization interval, performance of fuzzy rule-based classification system is analyzed and compared with six crisp-interval discretization based classifiers.Table 3 depicts the results obtained in the experimentation.For Pima Indian diabetes dataset, the six traditional classifiers achieved an average accuracy of 71.63%, 70.618% and 70.40% for three, five and seven intervals respectively and it is depicted in Fig. 4. For the same dataset, Fuzzy discretization based classifier obtained an accuracy of 55.46%, 64.58% and 52.99% for three, five and seven intervals respectively.
Fuzzy discretization based fuzzy classification obtained the highest accuracy of 64.58% at seven interval.In Bupa Liver Disorder dataset, traditional classifiers achieved an average accuracy of 57.66%, 55.94% and 55.99% for three, five and seven intervals respectively.For the same dataset, Fuzzy discretization based classifier obtained an accuracy of 53.04%, 51.01% and 48.11% for three, five and seven intervals respectively and it is depicted in Fig. 5.There is a drop in the performance values as an expert is not involved in designing the fuzzy sets.Fuzzy discretization based classifier yielded the highest accuracy of 53.04% at interval three.

Fig. 2 :
Fig. 2: Non-overlapping fuzzy sets of mean corpuscular volume (Mcv) Fig. 4: Performance of classifiers for Pima Indian diabetes datasetto handle data with vagueness and uncertainty where there is multiple overlapping data distribution.In order to handle such data, this study proposes a method for fuzzy discretization where each attribute is discretized into set of overlapping fuzzy sets.The proposed fuzzy discretization method is examined using fuzzy rule based classification system.Then it is compared with six traditional classification approaches.The results obtained from this study show that the classification accuracy remains stable with less deviation across different classification approaches.However, the proposed fuzzy classifier provides better accuracy than the existing traditional classifiers at least in one interval.Further work in this direction can be the use of fuzzy logic in other classifiers to provide a hybrid classifier that can improve the accuracy further.

Table 1 :
Description of Pima Indian diabetes dataset

Table 3 :
Classification performance evaluation for fuzzy and crisp-interval discretization