IMPROVING CLASSIFICATION PERFORMANCE OF NEURO-FUZZY CLASSIFIER BY IMPUTING MISSING DATA

In medical data classification, if the size of data sets is small and if it contains multiple missing attribute values, in such cases improving classification performance is an important issue. The foremost objective of machine learning research is to improve the classification performance of the classifiers. The number of training instances provided for training must be sufficient in size. In the proposed algorithm, we substitute missing attribute values with attribute available domain values and generate additional training tuples that are in addition to original training tuples. These additional, plus original training samples provide sufficient data samples for learning. The neuro-fuzzy classifier trained on this dataset. The classification performance on test data for the neuro-fuzzy classifier is obtained using the k-fold cross-validation method. The proposed method attains around 2.8% and 3.61% improvement in classification accuracy for this classifier.


INTRODUCTION
For various medical data classification problems, Data mining and Machine learning methods are effectively applied [1].The most critical objective in data mining is to identify the hidden patterns in data and use the acquired knowledge on new cases for classifying the data [2].Neural networks get the input from training data and adjust the weights mapping input to output that requires at least one tuple for the different cases.Neural networks cannot learn to various situations that are not available in the training data tuples [3].A sufficient number of training data tuples are needed to increase the classification ability of the classifier.The training dataset may have small data samples from its inception.Another reason for less number of training tuples is the case where the training data may contain data tuples with multiple missing attribute value.And such tuples need to be erased.In this case, extra training tuples added to the original training data set to improve the classification ability of the classifier.Authors [4] proposed the imputation method to produce additional training data tuples and these tuples are added to the original training data samples to generate new training data set.The classification performance of the classifier on this new training data set is improved.
If the number of attributes with missing values in a data instance is more, then we have to delete the data instance.The thumb rule, for instance deletion says that, if a data set has more than 5% missing values those tuples are retained.There are two basic methods for discarding data instance with missing values [6], the ways are complete case analysis and dropping.When these methods are applied, it assumed that deleted cases are a part of the vast dataset and cases are missing completely at random.The deletion of tuples may introduce bias.Subsequently, a small sample size affects the analysis.
Several methods for handling missing values are available in the literature [7] some of the popular techniques discussed here.First, one is to ignore the tuple, and this method is applied if a tuple contains several attributes with missing values, this method does not provide excellent results.The second method is to fill in the missing value manually.This approach is time-consuming and may not be feasible for an extensive data set with too many missing values.The next methods mentioned in the literature use the attribute mean to fill in the missing value and use the most probable value to fill in the missing value.Some researchers have used regression techniques, inference-based tools using a Bayesian formalism or decision tree induction for replacing the missing value.Missing values are imputed with reasonable probable values; these imputation-based procedures are applied instead of complete deletion.An objective of this method is to use known recognized associations from a valid range of values of the data set [8].
In multiple imputations method, for each incomplete information, multiple simulated values are selected.Then iterative data validation is carried with each simulated value substituted, five imputation copies are generally sufficient for the modest amount of missing data.Despite of these methods some more methods like replacement of missing values with the series mean, by the mean or median of nearby points, or linear interpolation between prior and subsequent known points, interpolating between the adjacent valid values above and below the missing one, or substitution of the linear regression trend value for that point also exists [9].
This paper consists of five sections.Section first covers the brief introduction about missing data and related issues.In section two, a short literature review is presented.Section three includes the problem definition and the proposed algorithm.Section four covers the experimentation and results and the last section is a conclusion.

RELATED WORKS
In this section, we have focused on some latest developments and relevant information about imputing missing data.Kang and Hyun [10] presented that missing data can decrease the algebraic power of training and can produce partial assessments, leads to unwarranted conclusions.He has handled the various methods of missing data and validated the tools handling missing data.Also presented a comparison of approaches handling the treatment of the missing data.Mosavi et al. [11] proposed an approach for fuzzy classification for missing data.Rough-fuzzy sets are included in logical type neuro-fuzzy systems; subsequently, a rough neuro-fuzzy classifier is derived.The neural network develops an additional purity, and the fuzzy scheme proceeds on the ability of knowledge.When Rough-fuzzy sets are included in NFS output is a rough neuro-fuzzy classifier.Robert K. Nowicki [12], presented a process to execute certain missing data imputation as a statistical method.If the input attributes for learning are numerical, the imputation uses Simpson's fuzzy min-max neural networks.
Shahla and Gerhard [13], proposed a weighted nearest neighbour method for imputing missing attribute values in categorical variables.This method explicitly utilizes the evidence of the relationship between attributes.The imputation error rate is low.M. Albayrak et al. [14], presented a realistic comparison of multiple imputations.The recurrent association of data learning neural network models for estimating missing values.Jin and Dong [15], proposed a new data cleaning method.Three comparative methods are performed to validate the model, and NN is used for pre-processing the training data.The algorithm trains a neural network and is used to create novel training data.The trained system produces several additional training instances are added to the new training dataset.The processed method improves the classification performance of the classifier.
Olanrewaju Akande et al. [16], presented multiple imputations with the mutual method for trade with missing values in numerical records.Missing values are filled with values coming from the predictive model estimation using observed data.It results in multiple, completed versions of the record.Authors also suggested advantages for the regression tree and Bayesian mixture model approaches, making both reasonable default engines for multiple imputations of categorical data.Tarle et al. [17], suggested the fuzzy neural network performs the classification on cleaned data with a correctly reduced feature set.The method has integrated the data cleaning method to improve data quality as a pre-processing method along with a bag of words for feature subset selection.Ezzine and Benhlima [18] presented a comparative analysis for listwise deletion, mean substitution, simple regression, and regression.
Susanti and Aziza [19], suggested handling missing value using DBN.DBN is a beneficial method to maintain the interactions between variables of data.The consequences of the estimate were used to fill missing values in the data.Support Vector Regression system is used for calculating the missing values.It is chosen for its performance as compared to other parallel systems.N. Anindita et al. [20], presented that the hepatitis dataset has an arbitrary arrangement of missing values.This arrangement can be measured by using MCMC and FCS as multiple imputation systems.The research focused the investigation on equating groups of multiple imputations system and PCA as the happening selection.
S. Azim and S. Aggarwal, [21], suggested the implementation of the 2-stage hybrid model to fill missing values.The proposed algorithm is tested with a simple and complex dataset with varying percentages of missing value and varying value of fuzzifier.The missing data that are not missing completely at random contain non-random elements that may prejudice the results.The deletion of missing data can bring in substantial bias into the results.Also, the reduced sample size may affect the analysis.The Mean-fill approach for finding the estimates of the values is common in missing data imputation.

PROPOSED ALGORITHM
In this work problem of insufficient training samples is handled.The training dataset may have insufficient data samples from its inception.
In another case, the training data may contain some data records with multiple missing attribute values.And such records with missing values need to be erased; in these situations, sufficient extra training tuples need can be added to the original training data set to improve the classification ability of the classifier.
In the proposed algorithm, missing data is added in original training tuples explicitly in a random manner.The tuples and features are chosen randomly for the same.Subsequently, missing attribute values are substituted with available domain values of the same attribute, and additional training tuples are generated and validated on the classifier.These tuples are in addition to original training tuples.These generated training samples, along with original training samples, increase the size of the data set and provide sufficient data samples for learning.The neuro-fuzzy classifier is trained on this dataset.The classification performance on test data for the neuro-fuzzy classifier is obtained using the k-fold crossvalidation method.

NEURO-FUZZY CLASSIFIER WITH BOW
The data is classified using the Neuro-Fuzzy classifier.The extracted features are given as the input to the Neuro-Fuzzy Classifier for classifying all the given data.The Neuro-fuzzy system has a three-layered architectural design; following diagram in fig. 1 presents the basic structure of the Neuro-fuzzy classifier system.Neuro-Fuzzy classifier is a fuzzy-based system that is trained by a learning algorithm derived from Neural Networks [22].The learning algorithm only performs on the local information and provides the local modifications in the fuzzy system.In general, a Neuro-Fuzzy system generates compelling solutions instead of using the system components individually [23].The steps used in the Neuro-Fuzzy classifier are explained in the following section.

FUZZIFICATION
The input values are the extracted features or attributes are acknowledged by the structure as the feedback, and then these feedback attribute values are fuzzifier based on the membership functions (MF).The MF is providing the membership to each feature to various classes.It is used to extracted features from unseen and inter-related data, according we have to get the additional accuracy of the sorting stage spending Neuro-fuzzy Structure.
Here, the π-type membership function is used to classify the data.The π-type MF has fuzzifier a factor that can be adjusted compared to the necessity of the problem.This controls the simplification capability by choosing a correct value of the fuzzifier a factor and provides more contribute for arrangement the data.The steepness of the Gaussian function is well-ordered by changing the fuzzifier value.The membership function after the Fuzzification process is expressed in the membership matrix.
The complete rows and columns in the membership matrix are cascaded and to translate it into a vector.This created vector is set as the input to the neural network.

NEURAL NETWORK
This stage, we have used Feed Forward Multilayer Perception classifier, it has 3-layers such as an input, unseen, and output layer.The overall amount of input nodes of the neural network is equal to the creation of the number of attributes and modules classes.The total number of output nodes from the neural network is the same as that of the number of classes [24].The whole number of hidden nodes is equivalent to the square root of the product, of the number of input and output nodes [25].

DEFUZZIFICATION
It is the method of translating the amounts of membership of output stated attribute inside their unwritten positions into strong statistical values, based on the output nodes of the neural network are carried out with defuzzification.A detail of working of Neuro-Fuzzy Classifier algorithm is available in [17].The data set DN is applied for training the neurofuzzy classifier.This neuro-fuzzy classifier is applied for the correctness of imputed data tuples.The imputed data tuples DPC are applied as test data on the said classifier.The test tuple that is correctly classified is a correctly imputed instance otherwise needs to be deleted.The correctly imputed tuples are part of DI.Compare DN, and DI to find out duplicate tuples constructed in DI.Duplicate tuple needs to be deleted.Merge DI in DN, and thus new set is DN1.The model is trained on new data set DN1, and the model executes with enhanced classification performance.

EXPERIMENTATION AND PERFORMANCE EVALUATION
The UCI repository data sets Australian, Breast, Lymph, Shuttle and Weather [26] were used for conducting the experiments.In MATLAB the proposed system is implemented.The missing data was explicitly added to few data tuples.The imputing was done following the proposed algorithm.The datasets DN, DN1, and DM, were applied to train the said classifiers.Readings of classification accuracy for these classifiers were acquired using the k-fold cross-validation method.
The accuracy is also obtained with 2 imputation methods first one is substitute missing attribute values with mode and second one substitute with most probable value [2].This is done to compare the performance of the proposed method with existing techniques.Table 1 where, IA -improved accuracy.

CONCLUSION
The proposed algorithm creates additional data tuples using domain-based multiple imputation methods and adds these tuples in original available training data.It enhances the classification ability of the classifiers.This proposed method utilizes a set of domain values for imputations of feature values.The correctness of the imputed tuple is verified on the classifier.The proposed method significantly enhances the classification performance of the classifiers.This technique is more suitable for small to medium data sets.The data imputation helps in the evolving enhanced and more accurate classifiers.The suggested method presents improved classification performance.The proposed method attains around 3.61% improvement in classification accuracy on a fuzzy neural network.

Figure 3 -
Figure 3 -Graph for classification accuracy comparison of the original and proposed Method

Table 1 . Presenting Comparison of proposed Method with Existing Techniques
presents a comparison of the proposed method with existing techniques.Let AN be classification accuracy of the classifier using the original data set (DN) as a training data.Similarly, AN1 denotes accuracy on imputed data set DN1 and AM on a dataset containing missing attribute values DM.The classification performances obtained with existing methods are denoted by AMode and AMP, that accuracy with mode and most probable value methods of imputation.It can be observed that the proposed method is moderately better than the existing techniques.

Table 2
presents the results obtained using the method as mentioned above for the proposed algorithm.The table also presents IN1 and IM denote the enhancement in classification accuracy.IN1 is an improvement in accuracy in AN1 with comparison to AN.That is the improvement in accuracy with the original data set (DN) as a training data to imputed data set DN1. Similarly, IM is calculated.The improvement in the Accuracy calculated by following equation 1and 2.