Classification Based on Attribute Positive Correlation and Average Similarity of Nearest Neighbors

The K-Nearest Neighbor algorithm (KNN) is a method for classifying objects based on the k closest training objects. An object is classified by a majority vote of its nearest neighbors. “Closeness” is defined in terms of the similarity measure between two objects. KNN is not only simple, but also sometimes has high accuracy. However, the quality of KNN classification result depends on the similarity measure between two objects and the selection of k. Moreover, the average similarity of the majority nearest neighbors may be less than the one of the minority nearest neighbors. To deal with these problems, in this study, we propose a new classification approach called APCAS: classification based on the attribute values which are positively correlated with one of the class labels and the average similarity of the nearest neighbors in each class. First, we define a new similarity measure based on the attribute values which are positively correlated with one of the class labels. Second, we classify a new object using the average similarity of the nearest neighbors in each class without selecting k. Experimental results on the mushroom data show that APCAS achieves high accuracy.


INTRODUCTION
KNN is widely discussed and applied in pattern recognition (Zhang et al., 2004).KNN is a method for classifying objects based on the k closest training examples."Closeness" is defined in terms of the similarity measure between two objects.KNN is among the simplest of all machine learning algorithms.However, firstly, the classification accuracy of KNN depends on the similarity measure between two objects.In order to achieve high accuracy, (Feng et al., 2005) proposed KNN-M algorithm for text categorization.The major difference between KNN-M and KNN lies in the calculation of text similarity on finding k-nearest neighbors.With categorical variable, the similarity between two objects is often computed using the simple matching approach.Nevertheless, the simple attribute value matching cannot reflect the importance of these matched attribute values to the class label.If two objects are similar, then they not only have many same attribute values and also have many same attribute values which are important to one of the class labels.If an attribute value is important, then it must be positively correlated with one of the class labels.
To deal with the problem of similarity measure, we define a new similarity measure based on the attribute values which are positively correlated with one of the class labels.If two objects are similar, then they have many same attribute values which are positively correlated with one of the class labels.A difficulty in this study is that there are few correlation measures which have proper bounds for effectively evaluating the correlation degree between the attribute value and the class label.The most commonly employed method for correlation mining is that of two-dimensional contingency table analysis of categorical data using the chi-square statistic as a measure of significance.Brin et al. (1997) analyzed contingency tables to generate correlation rules that identify statistical correlation in both the presence and absence of items in patterns.Liu et al. (1999) analyzed contingency tables to discover unexpected and interesting patterns that have low level of support and high level of confidence.Bing et al. (1999) used contingency tables for pruning and discovered correlations etc.Although the low chisquared value (less than the cutoff value, e.g., 3.84 at the 95% significance lever) effectively indicates that all patterns AB, ‫ܣ‬ ̅ B, A‫ܤ‬ ത , ‫ܣ‬ ̅ ‫ܤ‬ ത are independent, that is, A and B, ‫ܣ‬ ̅ and B, A and ‫ܤ‬ ത , ‫ܣ‬ ഥ and ‫ܤ‬ ത are all independent.The high chi-squared value only indicates that at least one of patterns ‫,ܤܣ‬ ‫ܣ‬ ഥ ‫,ܤ‬ ‫,ܤܣ‬ ഥ ‫ܣ‬ ഥ ‫ܤ‬ ത is not independent, so it is possible that A and B are independent, in spite of the high chi-squared value.Therefore, the chisquared value is not reasonable for measuring the correlation degree of A and B.
For other commonly used measures, the measure P(AB)/P(A)P(B) does not have proper bounds.P(AB)-P(A)P(B) (Piatetsky-Shapiro, 1991) is not rational when P(AB) is compared with P(A)P(B).For example, if P(AB) = 0.02, P(A)P(B) = 0.01, P(CD) = 0.99 and P(C)P(D) = 0.98, P(AB)-P(A)P(B) = P(CD)-P(C)P(D).The correlation degree of A and B is equal to the correlation degree of C and D. But, P(AB)/P(A) P(B) = 2 and P(CD)/P(C)P(D) = 1.01, the correlation degree of A and B is much higher than the correlation degree of C and D. In this study, we use the correlation measure correlation confidence (Zhong et al., 2006) to evaluate the correlation between two items.The measure correlation confidence has two bounds -1 and 1.We can see from Zhong et al. (2006) that the measure correlation confidence is reasonable.
Secondly, the quality of KNN classification result depends on the selection of k.The best choice of k depends upon the data.It is difficult to select an appropriate k (Anil, 2006).Moreover, the average similarity of the majority nearest neighbors may be less than the one of the minority nearest neighbors.Thus, it may be unreasonable to classify a new object by a majority vote.Therefore, we propose a new classification approach called APCAS.We not only use a new similarity measure and also classify a new object using the average similarity of the nearest neighbors in each class without selecting k.Experimental results on the mushroom data set show that APCAS achieves high accuracy.

METHOD AND EXAMPLE
In this section, we first introduce some related definitions and then give an example to explain the classification algorithm APCAS.
In statistical theory, X 1 , X 2 ,…, X n are independent if and only if ∀݇ and ∀1 ≤ ݅ ଵ < ݅ ଶ < ⋯ < ݅ ≤ ݊: We use the correlation measure correlation confidence to evaluate the degree of correlation relationships between any two objects.The correlation confidence of any two objects C and D is defined as follows (Zhong et al., 2006): From definition 2, we can see that if X and Y have high similarity, then they must have same attribute values which are positively correlated with one of the class labels and at the same time have high probability.
We illustrate the classification algorithm APCAS using the following example.
Example 1: Given a data set T as shown in Table 1.X7 is a test object.
From the example, we can see that it is reasonable for us to use the average similarity of nearest neighbors in each class.

EXPERIMENTAL RESULTS
All experiments are performed on mushroom characteristic dataset, which consists of 5643 objects.All objects have 23 attribute values.
We classify a new object using the average similarity of all training objects in each class in the first algorithm A1 and the third algorithm A3.We classify a new object using the average similarity of its nearest neighbors in each class in the second algorithm A2 and the forth algorithm APCAS.In both algorithm A1 and algorithm A2, we use similarity measure defined in definition 1.In both algorithm A3 and algorithm APCAS, we use similarity measure defined in definition 2.
In Table 2, we select training set by random named T-set.We select training objects in turn from 100 to 500.We select every 500 objects as test set in turn from 1 to 5000.We compare the average classification accuracy of algorithm APCAS with algorithm A1, A2 and A3.From Table 2, we can see that algorithm APCAS have higher classification accuracy than other algorithms.From Table 2, we can also see that algorithm A2 have higher classification accuracy than algorithms A1.Therefore, we can conclude from Table 2 that: • It is reasonable to use the average similarity of the nearest neighbors in each class.• Similarity measure defined in definition 2 is better than the one defined in definition 1.
In Table 3, we select 100 training objects by random.We select every 500 objects as test set in turn from 1 to 5000.We compare the classification accuracy of algorithm APCAS with algorithm A1, A2 and A3.From Table 3, we can see that algorithm APCAS have higher accuracy than other algorithms in every time.
In Table 4, we select 500 training objects by random.We select every 500 objects as test set in turn from 1 to 5000.We compare the classification accuracy of algorithm APCAS with algorithm A1, A2 and A3.From Table 4, we can see that algorithm APCAS have higher accuracy than other algorithms in many times.

CONCLUSION
Although KNN is simple, it suffers from some deficiencies.In this study, we proposed a new classification algorithm APCAS based on a new similarity measure.While measuring similarity between two objects, we not only think about the importance of an attribute value to the class label, but also consider the number of matched attribute values.If an attribute value is important, then it is not only positively correlated with one of the class labels, but also has high probability.If two objects are similarity, then they must have same attribute values which are important to the class label.In order to achieve high classification accuracy, we classify a new object by the average similarity of its nearest neighbors in each class.Experimental results show that APCAS achieves higher classification accuracy.

ACKNOWLEDGMENT
This study is supported in part by China NSF program (No. 61170129, 10971186), a grant from education ministry of Fujian, China (No. JA10202).
we can see that P(CD) has two bounds -1 and 1.According to the conception of correlation in statistical theory, any two objects C and D are positively correlated if and only if P(CD)>0.The training data set T has m distinct attributes A1, A2,… Am and a list of classes C1, C2,… Cm.All attribute values are categorical.We define a new

Table 1 :
Definition 2: (Similarity 2) X and Y are two object.Let X and Y have same attribute values v1, v2,… vm,.P(vi) Is the probability of attribute value vi.cj.Is a class label.If Y is belonging to cj and P(vicj)>0, then the similarity of X and Y is defined as follows:

Table 2 :
The comparison on average accuracy