Combining Imputation Method and Feature Weighting Algorithms to Improve the Classification Accuracy of Incomplete Data

It is a common problem that datasets in pattern classification tasks contain missing values. In order to deal with missing values, the K-Nearest Neighbor Imputation method with Euclidean distance is often used to find K nearest neighbors, and then the mean of these samples is considered as the filling value. Although this method can deal with the problem of missing values well, it cannot effectively deal with the classification task with noise and redundant features because it ignores the importance of features. In order to do this, the paper proposes a novel imputation method. In our method, missing values are firstly filled by using the K-Nearest Neighbor Imputation method based on Euclidean distance, then a feature weighting vector can be obtained by using a feature weighting algorithm trained on the initially filled data, and finally the K-Nearest Neighbor Imputation with weighted Euclidean distance is used to fill missing values. The experimental results on six datasets show that the proposed method can improve the classification accuracy of incomplete data.


Introduction
Classification is a common task in machine learning, which widely exists in many fields, such as computer vision, image processing [1], natural language processing and bioinformatics. The current mainstream classification algorithms include support vector machines [2], neural networks and K-Nearest Neighbors. In particular, these algorithms are mainly designed for complete datasets and are not suitable for incomplete datasets with missing values. However, incomplete datasets are common in practical applications. For example, the attribute values of more than 40% of the datasets in the UCI machine learning database [3] are missing to vary degrees. In the social questionnaire survey data, the questions that respondents are unwilling to answer can lead to missing data [4]. Therefore, it is very meaningful to improve the classification accuracy of incomplete data.
For the classification problem with missing values, there are three mainstream solutions at present: (1) Sample deletion method [5], that is, delete all samples with missing values, and then provide the remaining complete data for the classifier. This method is applicable to classification tasks with few missing samples. (2) Decision tree algorithms that can directly classify incomplete data, such as C4.5, the classification performance of the decision tree classifier is lower. (3) Imputation method. It firstly adopts the set rules to fill in the missing attribute values, and then uses mainstream classifiers based on the filled complete dataset for classification. Because this method can usually achieve high classification accuracy, it has become the mainstream method to deal with incomplete data classification [6]. At present, there are three common imputation methods: Zero Imputation (ZI), Mean Imputation (MI) [7] and K-Nearest Neighbor Imputation (KNNI) [8]. ZI is very simple, which sets all missing values to 0. For a given feature, MI allows the mean of all its observations to be the predicted value of the missing value, thus maintaining the invariance of the feature mean. Since all missing values of a given feature are the same, both MI and ZI ignore the volatility of the data. Different from the first two methods, KNNI is based on the sample to fill in missing values. For samples with features of missing values, KNNI firstly finds its K nearest neighbor samples based on Euclidean distance, and then uses the mean of the observable values in these samples to fill in the missing values. KNNI can better deal with incomplete data classification since it considers the similarity between samples when filling missing values.
Although KNNI can handle missing value filling better, it ignores the importance of features when filling missing attribute values. For this reason, this paper proposes an algorithm that combines the imputation method and feature selection [9]to solve the classification problem of incomplete data. The algorithm firstly uses KNNI to fill in the missing values of the incomplete dataset, and then trains the feature weighting algorithm based on the filled data to obtain the importance of the feature, and finally fills in the missing value of the original incomplete dataset based on the feature importance.
The rest of this article is arranged as follows: Section 2 introduces the algorithm proposed in this article in detail; Section 3 introduces the comparative experiment and result analysis of the proposed algorithm and the traditional method. Finally, Section 4 summarizes the content of this article.

Method
In this section, we propose an algorithm for missing value filling based on feature importance. It is assumed that the missing value filling classification task consists of a training set and a testing set. The proposed algorithm firstly fills the training set with initial missing values based on the K-Nearest Neighbor Imputation of Euclidean distance, and then uses the feature weighting algorithm to obtain the feature weighting vector based on the filled training set. For the obtained feature weighting vector, the weighted Euclidean distance can be obtained by considering it as the weight of the corresponding feature. Finally, the proposed algorithm uses the K-Nearest Neighbor Imputation algorithm to fill in the missing values of the original unfilled training set based on the weighted Euclidean distance. We also perform similar processing for missing values in the testing set.
In order to briefly describe the proposed algorithm, the paper defines some symbols. Let ( , ) D  X y represents an incomplete dataset, where indicates the corresponding label vector, n represents sample's total number, and d is the feature dimension. The filled dataset is represented byˆ( , ) D  X y . In addition, this paper uses feature weighting vector 1 2 d w w w w = [ , , ..., ] to express the importance of features. In the following, we will introduce the K-Nearest Neighbor Imputation based on Euclidean distance, the feature weighting algorithm for obtaining a feature importance vector, and the K-Nearest Neighbor Imputation based on weighted Euclidean distance.

K-Nearest Neighbor Imputation algorithm based on Euclidean distance
In this section, the K-Nearest Neighbor Imputation method based on Euclidean distance is used to fill the missing values in incomplete dataset D and transform it into a complete datasetˆ( , ) D  X y . In order to fill a missing value it x in a given sample i x , the K-Nearest Neighbor Imputation method uses the following steps: 1) Use all non-missing samples and all missing samples in D to construct sample sets C and M respectively.
2) In order to fill in any missing item it x in M , We need to use the observable value in i x to calculate the distance from all samples in C , According to them, find the K nearest samples of i x x and sample j x in C is calculated as follows: 3) Finally, the missing value is filled in with the mean value of the t th feature of the K Nearest Neighbor samples: where ( ) Near i x represents the set of K nearest neighbor samples of i x .

Feature weighting algorithm
In order to obtain the feature importance vector, the feature weighting algorithm can be trained based on the initially filled training dataset. In this study, we choose three feature weighting algorithms based on nearest neighbor model: ReliefF [10],NCFS [11]and WKNN [12] to perform experiments.

ReliefF
ReliefF is a filtered feature selection algorithm. Its specific idea is to randomly select samplesˆi x from the training set. According to 1 k neighboring samples ofˆi x the same class, the attribute values of 1 k nearest neighbor samples of different classes are used to assign scores to the features.

NCFS
NCFS solves the feature weighting vector by maximizing the classification accuracy of the leave-one-out method. In NCFS, the probability that the sampleˆi x refers toˆj x is The probability of sampleˆi x being correctly predicted can be expressed as: Based on the above definition, we can obtain the approximate classification accuracy of the leave-one-out method . In addition, in order to obtain a sparse weighting vector solution, NCFS further introduces 1 L norm in the objective function. Therefore, the final optimization problem can be formalized as: The balance parameter  is a positive constant number. In particular, by replacing q w with 2 q w  , NCFS removes inequality constraints during optimization.

WKNN
WKNN is also an embedded feature selection algorithm for regression tasks. When used for classification, it is similar to the algorithm principle of NCFS. The most obvious difference is that it uses Euclidean distance to calculate the distance between samples.

Weighted K-Nearest Neighbor Imputation algorithm
In order to fill in missing values based on the feature importance vector, the weighted Nearest Neighbor Imputation algorithm needs to use the unfilled training dataset D and the filled training set D at the same time. For a certain missing value it x in a given sample i x , the weighted K-Nearest Neighbor Imputation algorithm uses the following steps to fill in missing values: 1) Let the corresponding sample of i x inD beˆi x , the set of all samples in the set C and M corresponding to the samples areĈ ,M inD respectively.
2) Calculate the weighted Euclidean distance betweenˆi x and all samples inĈ .The formula for calculating the weighted distance between samples is : Get the index set I of the K Nearest Neighbor samples ofˆi x through Nearest Neighbor search.
3) Based on the index set I and the original training data, the final filling value is : 1

Experimental result
In order to verify the effectiveness of the proposed algorithm, we compared the proposed algorithm with other algorithms on six datasets. Next, we firstly introduce the dataset and experimental settings, and then give the experimental results and related analysis.

Datasets
The six datasets used in the experiment are glass,hayes-roth,iris, lymphography,monk-2 and tae 1 . Table 1 shows their specific information. In order to calculate the classification accuracy on a given data set, we randomly select 70% of the samples as the training set and 30% of the samples as the testing set. Repeat the above process 10 times, the experimental result is the average classification accuracy of 10 times. In order to avoid the influence of the feature value range, we normalized the features.

Experimental settings
The parameter K of the KNNI and KNN used in this paper is 3. The relevant parameter settings of the feature selection method NCFS are the same as those in [11]. The codes of WKNN and reliefF can be downloaded from github2,3 respectively. The parameter 1 k in ReliefF in section 2.2.1 takes 5. After the feature weighting vector is obtained by the feature selection method, we normalize it. In addition, we used MCAR [13] mechanism to generate corresponding datasets with missing rates of 5%, 10%, and 15% for each dataset. Besides, in order to further verify the effectiveness of the proposed algorithm, we add noise to the six datasets. The number of noise features added to each dataset is 10, and the noise data are extracted from the normal distribution with mean value of 0 and standard deviation of 1. We only use MCAR mechanism to generate datasets with a missing rate of 5% on noisy datasets.

Experimental results and analysis
This section shows the comparison results of KNNI, KNNI+ReliefF, KNNI+NCFS and KNNI+WKNN on 6 datasets. Figure 1 shows the average classification accuracy of each algorithm on a dataset without adding noise. In Figure 1, the abscissa represents the missing rate of data, and the ordinate represents the average classification accuracy of the algorithm. It can be seen from Fig. 1 that the classification accuracy of KNNI+NCFS and KNNI+WKNN is higher than that of KNNI algorithm on the whole, while the comparison results of KNNI+ ReliefF and KNNI are unstable. Therefore, we can say that considering the importance of features in KNNI can indeed improve the classification accuracy of incomplete data, but we need to choose a feature weighting algorithm with good effect. In Figure 1, we can also see that the classification accuracy of KNNI+ WKNN is higher than that of KNNI+ NCFS. Therefore, in the experiment of this paper, the effect of selecting WKNN algorithm is the best. In order to further verify that the proposed algorithm can improve the classification accuracy of incomplete data, we also compare the classification capabilities of KNNI, KNNI+ReliefF, KNNI+NCFS and KNNI+WKNN on the dataset with noise. The specific results are shown in Table 2. Table 2 shows the average classification accuracy and standard deviation of each algorithm. Bold characters in Table 2 indicate the optimal value of each row. It can be seen from Table 2 that among the four algorithms, KNNI+WKNN has the highest classification accuracy, followed by KNNI+NCFS, but the classification accuracy of KNNI+ReliefF is indeed not as good as KNNI. In summary, we finally choose KNNI+WKNN as the algorithm proposed in this article.

Conclusion
This paper combines the importance of features with the K-Nearest Neighbor Imputation method, and proposes a filling method which can effectively improve the classification accuracy of incomplete data. The proposed method firstly uses K-Nearest Neighbor Imputation method to fill the original incomplete dataset, thereby transforming it into a complete dataset. Second, the feature weighting algorithm is used to obtain a feature weighting vector. Finally, the K-Nearest Neighbor Imputation method with the weighted Euclidean distance is adopted to fill the original incomplete dataset again. This paper compares KNNI, KNNI+ReliefF, KNNI+NCFS and KNNI+WKNN on the six datasets. The experimental results show that KNNI+NCFS and KNNI+WKNN are better than KNNI. Therefore, the proposed algorithm can improve the classification accuracy of incomplete data, but it is necessary to select a feature weighting algorithm with good effect to improve the classification accuracy of incomplete data. In the future, we will consider how to fill regression datasets with missing values.