Outlier Analysis of Categorical Data using NAVF

Introduction Outlier analysis is an important research field in many applications like credit card fraud, intrusion detection in networks, medical field .This analysis concentrate on detecting infrequent data records in dataset. Most of the existing systems are concentrated on numerical attributes or ordinal attributes .Sometimes categorical attribute values can be converted into numerical values. This process is not always preferable. In this paper we discuss a simple method for categorical data is presented. AVF method is one of the efficient methods to detect outliers in categorical data. The mechanism in this method is that, it calculates frequency of each value in each data attribute and finds their probability, and then it finds the attribute value frequency for each record by averaging probabilities and selects top koutliers based on the least AVF score. The parameter used in this method is only “k”, the no. of outliers. FPOF is based on frequent patterns which are adopted from Apriority algorithm [1]. This calculates frequent patterns item sets from each object. From these frequencies it calculates FPOF score and finds the least koutliers as the least FPOF scores. This method takes more time to detect outliers comparing with AVF. The parameters used in it are σ, a threshold value to decide frequent sub sets in each data object. The next method is based on Entropy score. Greedy [2] is another method to detect outliers from categorical data. The previous approaches used to detect outliers were


Introduction
Outlier analysis is an important research field in many applications like credit card fraud, intrusion detection in networks, medical field .This analysis concentrate on detecting infrequent data records in dataset. Most of the existing systems are concentrated on numerical attributes or ordinal attributes .Sometimes categorical attribute values can be converted into numerical values. This process is not always preferable. In this paper we discuss a simple method for categorical data is presented. AVF method is one of the efficient methods to detect outliers in categorical data. The mechanism in this method is that, it calculates frequency of each value in each data attribute and finds their probability, and then it finds the attribute value frequency for each record by averaging probabilities and selects top k-outliers based on the least AVF score. The parameter used in this method is only "k", the no. of outliers. FPOF is based on frequent patterns which are adopted from Apriority algorithm [1]. This calculates frequent patterns item sets from each object. From these frequencies it calculates FPOF score and finds the least k-outliers as the least FPOF scores. This method takes more time to detect outliers comparing with AVF. The parameters used in it are σ, a threshold value to decide frequent sub sets in each data object. The next method is based on Entropy score. Greedy [2] is another method to detect outliers from categorical data. The previous approaches used to detect outliers were 2 Existing Approaches Statistical based This method adopted a parametric model that describes the distribution of the data and the data was mostly unvaried [3,4]. The main drawbacks of this method are difficulty of finding a correct model for different datasets and their efficiency decreases as the no. of dimensions increases [4]. To rectify this problem the Principle component method can be used. Another method to handle high dimensional datasets is to convert the data records in layers however; these ideas are not practical for more than or equal to three dimensions.

Distance-Based
Distance based methods do not make any assumptions about the distribution of the data records because they must compute the distances between records. But these make a 1 high complexity. So these methods are not useful for large datasets. There are some improvements exist in the distance-based algorithms, such as Knorr's et al. [5], they have explained that apart of dataset records belong to each outlier must be less than some threshold value. Still it is an exponential on the number of nearest neighbours.

Density Based
These methods are based on finding the density of the data and identifying outliers as those lying in regions with low density. Breunig et al. have calculated a local outlier fac-tor (LOF) to identify whether an object contains sufficient neighbour around it or not [6]. They have decided a record as an outlier when the record LOF which is a user defined threshold. Papadimitriou et al. presented a similar technique called Local Correlation Integral, which deals of selecting the minimum points (min pts) in LOF through statistical methods in [7]. The density based methods have some advantages that they can detect outliers that are missed by techniques with single, global criterion methods. The terminology used in this paper is given below Minimum support of frequent itemset Support(I) Support of Itemset I

Algorithms Greedy algorithm
If any dataset consists outliers then it deviates from its original behavior and this dataset gives wrong results in any analysis. The Greedy algorithm proposed the idea of finding a small subset of the data records that contribute to eliminate the disturbance of the dataset. This disturbance is also called entropy or uncertainty. We can also define it formally as 'let us take a dataset D with m attributes A1, A2---Am and d(Ai) is the domain of distinct values in the variable Ai, then the entropy of single attribute Aj is Because of all attributes are independent to each other, Entropy of the entire dataset D={ A1, A2--------Am} is equal to the sum of the entropies of each one of the m attributes, and is defined as follows When we want to find entropy the Greedy algorithm takes k outliers as input [2]. All records in the set are initially designated as non-outliers. Initially all attribute value's frequencies are computed and using these frequencies the initial entropy of the dataset is calculated. Then, Greedy algorithm scans k times over the data to determine the top k outliers keeping aside one non-outlier each time. While scanning each time every single non-outlier is temporarily removed from the dataset once and the total entropy is recalculated for the remaining dataset. For any nonoutlier point that results in the maximum decrease for the entropy of the remaining dataset is the outlier data-point removed by the algorithm. The Greedy algorithm complexity is O(k *n*m*d), where k is the required number of outliers, n is the number of objects in the dataset D, m is the number of attributes in D, and d is the number of distinct attribute values, per attribute. Pseudo code for the Greedy Algorithm is as follows Algorithm: Greedy Input: Dataset -D Target number of outliers -k Output: k outliers detected label all data points x 1 ,x 2 ,---x n as non-outliers Calculate initial frequency of each attribute value and update hash However entropy needs k as input and need to find number of outliers more times to get optimal accuracy of any classification model.

AVF algorithm
The algorithm discussed above is linear with respect to data size and it needs k-scans each time. The other models also exist which are based on frequent item set mining (FIM) need to create a large space to store item sets, and then search for these sets in each and every data point .These techniques can become very slow when we select low threshold value to find frequent item sets from dataset Another simpler and faster approach to detect outliers that minimizes the scans over the data and does not need to create more space and more Search for combinations of attribute values or item sets is Attribute Value Frequency (AVF) algorithm. An outlier point xi is defined based on the AVF Score below: In this approach [1] again we need to find koutliers many times to get optimal accuracy of any classification model. Pseudo code for the AVF Algorithm is as follows The AVF algorithm complexity is lesser than Greedy algorithm since AVF needs only one scan to detect outliers. The complexity is O (n * m). It needs 'k' value as input. In FPOF [8] this has discussed frequent pattern based outlier detection, in this too k-value and another parameter 'σ 'are required as threshold. This also discussed about frequent pattern based method to find infrequent object, in this too it requires k-value, and another parameter 'σ' as input.

N AVF algorithm
This proposed model (NAVF) has been defined as an optimal number of outliers in a single instance to get optimal precision in any classification model with good precision and low recall value. This method calculates 'k' value itself based on the frequency. Let us take the data set 'D' with 'm' attributes A1, A2-----Am and d (Ai) is the domain of distinct values in the variable Ai. k N is the number of outliers which are normally distributed. To get 'k N' this model used Gaussian theory. If any object frequency is less than "mean-3 S.D" then this model treats those objects as outliers. This method uses AVF score formula to find AVF score but no kvalue is required. Let D be the Categorical dataset, contains 'n' data points, x i , where i= 1…n. If each datapoint has 'm' attributes, we Step 6: If Fi< a, then declare x i as outlier Step 7: return K N detected outliers.

Experimental Results
In this paper this model has been applied on Breast Cancer, Nursery data and Bank marketing data from UCI Machine repository [9]. This method has implemented the approach of using MATLAB tool. We ran our experiments on a workstation with a Pentium( R ) D, 2.80 GHz Processor and 1 .24 GB of RAM. Nursery data consists of nine attributes and 6236 records. This data divided into two parts based on parent attribute, first part contains 4320 records with usual parent type, and second part contain 1916 records with pretentious parent type which is used as outliers in our experiment. In first iteration 956 sample records are selected randomly using Clementine tool; from each two records one is selected. These 956 records are mixed up with part one and applied normally distributed AVF to get outliers. The found outliers are given in Table 2. Similarly in the next iteration 382 records are selected randomly as one record from each five records and mixed up with first part and applied the same process. The results are given in the Table 2.
Similarly one record is selected from each eight records and ten records and repeated the same process. This method has been implemented on Nursery dataset, Breast cancer and Bank dataset which are taken from UCI Machine learning repository [9]. This method compared with different number of outliers from each sample. Comparison graph is given in Figure 3. For Nursery Data Figure 1 shows the frequency of different attribute values and their structure. Figure 2 shows the outliers which are appeared in red colour for the Nursery Data These red collared points are under "Mean-3SD" line which we can observe in the Figure 2. This Figure 2 is drawn by MATLAB tool for the data taken by the 1-in-2 sample method.
In the first sample from nursery the NAVF model found out only 4.60% of outliers from 956 outliers which are mixed up with 4320 records which totals to 5276 records.. In the next sample of 382 records, 34.8% of correct outliers are found by NAVF. For the sample of 238 records NAVF found 239 outliers in which 238 are correct, which means that NAVF model found 100% outliers correctly. Similarly NAVF model found 100% outliers in the sample of 190 records (as outliers) mixed up with 4320 records in part one. In case of breast cancer dataset, correct outliers found by NAVF model did not touch 100%. In breast cancer data 119, 48, 29, 23 outliers are selected respectively using 1-in-2, 1 -in-5, 1-in-8, 1-in-10 sampling from benign breast cancer. NAVF found 35, 9, 9, and 14 correct and 0, 0, 0, 1 wrong from 119, 48, 29, 23 outliers. The results are given in Table  3.The comparison of outlier detection is shown in Figure 4. In Bank marketing data, only categorical attributes are selected and 2644, 1027, 661, 528 outliers are selected respectively using 1in-2, 1 -in-5, 1-in-8, 1-in-10 sampling   Table 4 and its graph is given in Figure 5.   Table 5. "Classifiers Results on Bank Data"

Conclusion and Future Work
To sum up, this proposed method gives the optimal number of outliers 'K N '.In existing models it is mandatory to give the number of outliers to find them .While taking the num-ber of outliers sometimes the original data may be missed. If any classifier modelled using this data, wrong classifiers may be modelled. In future there is a possibility of checking the precision and recall values of each model with the existing models. The same method can also be applied on mixed type of dataset.