A Robust Scalable Model Using Hybrid Approach for the Detection of the Projected Outliers

The abnormal and anomalous observations even in the advanced technological era proves to be the biggest jolt to the concerned industry. To reduce and eliminate the outliers from the massive data streams, it is important to accurately highlight them from the higher dimensional data which is itself very challenging. In this study, a Scalable outlier detection model is proposed which is robust enough to resist and detect the projected outliers that are lying at some lower dimensional subspaces. This model exploits the problem of curse of dimensionality which is very frequent in large data streams and massive datasets. Rapid distance and density based approaches are used and then the probability density is measured by Gaussian Mixture Model. Baye's Probability is applied to the final observations so as confirm them as the projected outliers.


INTRODUCTION
Outlier detection aims at finding out the peculiar, considerably dissimilar, exceptionally and markedly different data items from the massive oceans of the datasets, data streams and databases and hence an important aspects of the data Mining.Detection of outliers is very common and obvious while dealing with lower dimensional data but the major problem of curse of dimensionality arises while dealing with higher dimensional datasets because as soon as the dimensionality of the data goes higher, the prominence of outliers is hidden as the outliers is hidden as the outliers are embedded in some lower dimensional subspaces.As the large and the massive amounts of the DataStream and data are stored in the large databases and datasets, there is a requirement and a need of employing an effective and an efficient methodology for analyzing it to make the information useful that are contained implicitly in the data.
Knowledge Discovery Data or KDD is actually a non trivial process in identifying valid, novel and potentially useful and ultimately understandable knowledge from the data.Usually the research done on KDD includes and focuses on finding out the patterns that are considerably applicable to the portions of the datasets.However, in order to detect the malicious activities like fraud detection, network sensors, spam detection, intrusion detection etc.To find out the anomalous observations among the various data points is the basic notion to find out the outliers but at the same time it should be crystal clear that it is not always the case, as sometimes they are very much interesting to us, so an occurrence of an event depending upon the situation decides its outlierness, for example, if a data item has the probability of occurrence is 0.0001 and it happens, it's an outlier but if the data item has the probability of occurrence say, 0.998 and it does not occur, it's again an outlier.So, probability density as well as the likelihood of data item decides it's outlierness.Similarly, there are many factors and existing methodologies that helps to find out the outliers from the large datasets and databases in lower dimensional subspaces, but as soon as the dimensionality of the data goes higher, the data points or the data items are almost equidistant from each other due to which the higher dimensional outliers or the projected outliers are hidden in the lower dimensional subspaces and they are not highlightened.This problem is usually known as the curse of dimensionality.
One of the solutions to this kind of problem is SPOT i.e., Stream Projected Outlier Detection model (Zhang, 2009) which employs a new window based time model and decaying data summaries to capture statistics from the data streams for outlier detection.Actually to find out the projected outliers from the higher dimensional data is quiet challenging.So, it is utmost important to build the Robust model under the following characteristics or parameters: • Scalable: The system performance should not substantially degrade as the no. of the sequences.Thus there should be an optimized space and Time Complexity.
• Unique: The system model should find a unique solution.
• Comprehensible: If the system model highlights the set of the projected outliers, it must give the proper explanation that why the particular data objects are termed as anomalous.
• Robust: The performance of the model should not be completely dependent on the input provided by the experts.The initial solution should be very rigid and capable enough to handle the adverse conditions.
• Repeatable: The model should be capable enough to provide the repeatable solution.Thus the each time when the system is run, it should identify the exactly same amount of outliers.
Clustering is a major aspect to detect the projected outliers in a sense that clustering and its anomalies are complement to each other in a way that the observations which are not included in a cluster are treated as noise, exceptions, novel data items w.r.t that cluster.To make the effective and efficient clusters the outliers must be removed from them which may be sometimes regarded as noise.The presence of outliers may have an adverse effect on the reliable clusters so it is desirable to remove the outliers from the prominent clusters but at the same time one cannot say completely that the outliers are the byproduct of clustering although some of the researchers have agreed to it, that's why there are lots of algorithms and research work that is done which emphasis only on clustering but not on the outlier detection and analysis.So, clustering is a very important technique of anomaly detection.The popular clustering algorithms in context of KDD are CLARANS (Ng and Han, 1994) DBSCAN (Ester et al., 1996), BIRCH (Zhang et al., 1996), STING (Wang et al., 1997), WAVECLUSTER (Sheikholeslami et al., 1998), DenCLUE (Hinneburg and Keim, 1998), CLIQUE (Agrawal et al., 1998) are somehow helpful in detecting the anomalous observations or outliers, but their main objective is to do the clustering instead of outlier detection, so, there is a requirement of a model to optimize the outlier detection in an effective and an efficient way from the higher dimensional outliers.In this study we are trying to build a robust scalable clustering model using the hybrid approach and then to evaluate various parameters against it to prove its validity.

LITERATURE REVIEW
There are many outlier detections methodologies already proposed by so many researchers.Distance based outlier detection techniques were first of all introduced by Knorr and Ng (1998).According to them, "An object p in the data set DS is a DB(q, dist) outlier if at least fraction q of the objects in DS lie at the greater distance than the distance from p. "This definition can be further generalized and used in various statistical outlier detection methodologies.Then Ramaswamy et al. (2000) extended the above definition by providing the proper ranking or the outlier score to all the data points.According to the def, given by them " If there are two integers kn and w, an object p is said to be an outlier if less than w objects have higher value for D K than p, where D k denotes the distance of the kth nearest neighbor of the object p.The concept of considering the whole neighborhood of the objects to determine the outliers was given by Angiulli et al. (2006) in which all the points were ranked based upon the sum of the distances from the k-nearest neighbors rather than considering individually the distance to the kth nearest neighbor.Breunig et al. (2000) proposed a very crucial method Local Outlier Factor (LOF) for each data object in a particular data set that quantifies how outlying an object is by indicating the degree of outlierness.Zhang et al. (1996) proposed a method named as Local Distance based outlier detection method for detecting outliers.LDOF of an object determines the degree of the detection of an object from its neighborhood.There are many clustering methods like CLARANS, DBSCAN, BIRCH, CURE to detect the outliers.Moreover Aggarwal and Yu (2000) studied the impact of high dimensionality on the distance based outlier detection but most of these approaches are less meaningful in the higher dimensional data.Lazarevic et al. (2013) proposed a feature bagging approach to handle higher dimensionality in which various multiple outlier detectors are combined and then built on arandomly selected subset of features.Moreover, the statistical literature (Rousseeuw and Leory, 1987;Rocke and Woodruff, 1996;Atkinson, 1994) extensively includes the various outlier detection methods based on Mahalanobis distance.To provide the robustness, the scattered matrix and locations are used.Rousseeuw and Van Driessen (1999) provide more robustness to the MD based methods feasible for large sample size data.Danuser and Stricker (1998) provided the framework of least squares fitting of multiple parametric models.The classification algorithm for the projected outliers which was developed on the basis of the robust dimensionality reduction technique was introduced by Fidler et al. (2006).Time series using an autoregression model was proposed by Takeuchi and Yamanishi (2006).Pearson was first who introduced the case of Gaussian mixture with two components and also focussed various methods to estimate the mixture parameters (Dempster et al., 1977) then provided the theoretical framework of EM algorithm.Hasselblad (1966), Day (1969), Wolfe (1970) and Duda and Hart (1973) also provided the detailed treatment of mixture densities and EM algorithm.Roeder and Wasserman (1997) provided the detailed illustration of the Bayesian perspective on density estimation using GaussianMixtures.Richardson and Green (1997) provided the extension of Bayesian sampling to be cases where the no. of Gaussian components are unknown.

AN OVERVIEW OF THE PROPOSED APPROACH
The proposed method has been made up after concluding the literature survey.It utilizes the basic notion of Distance based outlier detection and then to detect the degree of outlierness, density based outlier detection used.To provide the robustness to the model, Gaussian Mixture Model is used which calculates the probability density function of the data objects and finally Baye's probability is used to confirm the outliers as Projected outliers.

Distance based outlier detection method:
The modern distance based approach states a simple notion that a data point is said to be an outlier if its locality is sparsely populated (Lazarevic et al., 2013).It is one of the important methods of outlier detection which is distribution free as well as easily applicable for the various types of data.According to distance based approach, Given a dataset X, an object xŰ X is a DB(α, δ)-outlier if: Here, n = |x|, the no. of objects α, δ Ԑ R (0≤α≤ 1) are parameters There are two drawbacks of the distance based methods: • Setting the distance threshold δ is quite difficult in practice but setting α is not so difficult because it is always close to 1. • The lack of the ranking of the outliers as it always gives the binary result of whether the data object is an outlier or not.K th nearest neighbor, i.e., KNN score q k th NN(x): = d k (x;X)-dk(x,X) is the distance b/n x and its k th NN in X, but again the two drawbacks of K th nearest neighbor Approach is: o Scalability: O(n 2 ) and its solution is the partial computation of the pairwise distances to compute the scores only for the top t outliers.o Detection ability and its solution is to introduce the degree of outlierness by switching to density based methods.

Density based outlier detection method:
The basic idea of the density based outliers is to calculate and compare the densities around a point with the densities around it's local neighbors and that related density when computed becomes the outlier score.The general assumption of the density based outlier methods is that around the normal object the density of the data object is similar to the densities around its neighbors whereas the density of outlier is quiet different to the density around it's neighbors.LOF or Local outlier Factor is based on pairwise distances and is a very prominent outlier detection method.The local outlier factor (LOF) q LOF (x) is defined as the ratio of the local reachability densities of x and the average of the localreachability densities of the k th nearest neighbors.Consider the following: The reachability distance is: Rd(x,x'):= max{d k (x', x), d(x, x')}.The local reachability density is: The LOF is defined as: In simple and straightforward calculation, LOF = 1 implies point is in a cluster i.e., region with homogeneous density around the points and its neighbors and if LOF>>1: implies point is simply an outlier as mostly the interest is always shown in top n outliers, so LOF saves the time as there is no need to compute LOF for all the data points thereby saving the overall run time.
Gaussian mixture model: GMM is one of the robust techniques to detect the anomalous observations and is generally used for the multimodal forms of the data distributions mixture models and the GMM parameters are estimated through the Expectation Maximization algorithm and it is also effectively applied for the clustering and classification.GMMs were initially used for the structural damage detection, then further it was used with Mahanalobis distance given by Nair and they and proved useful in solving damage detection.GMM model consists of the means, co-variances and a probabilistic assignment of every data point to the Gaussians GMM is the sum of K-Gaussian densities and it has the following forms: Every Gaussian component containing the mixture ɳ(x/µ k , Ɖk) has its own mean µ k and covariance Ɖk.The parameters Π k are called mixing coefficients which satisfies: (5) Gaussian Mixture distribution is used to maximize the log of the likelihood function (Dempster et al., 1977) and the EM framework is used which is given by the following steps: • The parameters of GMM are initialized, i.e, means µk, covariance Ʃk and mixing coefficients Πk • E-Step(Expectation): In this step, the responsibilities γ(Z n , k) are evaluated using the most recent values of GMM parameters: where, Z n k is an element of k-dimensional binary random variable z for nth learning pattern.

M-Step (Maximization Step):
In this step, the weighted means and variances are recomputed: Here, µ k new is a scalar whereas Ʃ k new denotes the d dimensional vectors where N k =Ʃγ(Z n (k) and N is the no of learning patterns.Figure 1 illustrates this: These steps are intended till the convergence.
Baye's probability: Suppose, X = (x 1 , x 2 , x 3 …… x n ) denotes the set of N observations from d dimensional space.If there are two classes say one of outlier and another of normal class N. Let the series of the projected outliers be M 0 = {m1, m2, m3} be the corresponding outlier score of the higher dimensional projected outliers assigned to X.If there is no loss of generality, let us assume that the higher the value of pi, more likely xi is an outlier.i.e., mi = P(O| mi).
The probability that xi is normal can be computed accordingly by P (N| mi) = 1-pi.According to it: Embedded approach and exact procedure: • Take the real or synthetic data that is basically used for clustering from UCI repository.• Make the absolute clusters from it using the Kmeans clustering algorithm and then the cluster validation problem is resolved from it by choosing or applying the appropriate threshold to it.• Out of these absolute clusters, firstly the distance based outlier detection technique is applied, i.e., the distance of the data items from the cluster centroid is calculated first and its distance from the k-nearest neighbor is calculated.• Attribute Relevance Analysis is done so as to find the projected clusters.All the irrelevant attributes contain the noise or the outliers which are also termed as sparse data points whereas the relevant attributes exhibit some cluster structure (Agrawal et al., 1998).Higher the density of the points than it's surrounding regions highly clustered the region would be.In order to investigate the densely populated region, the sparseness degree has to be computed and hence the density based outliers has to be applied.Then the local outlier factor has to be calculated for the projected clusters with the notion: LOF = 1 {data item is a normal object} LOF>> 1 {data item is simply an outlier} • After, the distance and density based methodology, apply the Gaussian Mixture Model to the output of both the distance and density based methods in order to calculate the probability density function.
To do so, M-Step-Re-estimation of GMM parameters to upto date values of responsibilities γ(Z n k).Free parameters of the Gaussian mixture model consist of the means and covariance matrices of the Gaussian components and the weights indicating the contribution of each Gaussian to the approximation of P(X | C j ).The Log-Likelihood function is given: This step helps us to find the projected outliers.
• Then Baye's Probability is applied to the output obtained from the Gaussian Mixture Model.Baye's probability relates the posterior density functions with the current probability density function to calculate or check the likelihood of a point and if there is a quiet or a little chance of categorizing a point as a normal one or it can be included in the cluster, then that point is pruned out.In this way, the final output obtained will be completely a refined and an exact stream of the projected outliers (Fig. 2).

RESULTS AND DISCUSSION
Initial results: First of all, absolute clusters are formed are formed using K-means clustering algorithm, then the distance based outlier detection is applied to detect the outliers.Then the cluster validation problem is resolved for it by choosing the exact threshold value.This threshold should be exact for the particular clusters so that neither too much outliers are highlighted nor the true outliers should be hidden.Secondly density based approach will exploit the further higher dimensional outliers.Then again the clusters are validated by choosing the appropriate radius value.The final projected outliers are highlighted after estimating probability density functions and conditional probabilities by using Gaussian Mixture model and Baye's Probability respectively.Probability density function performs the relevance attribute analysis and Baye's probability converges the attributes by applying the conditional check to highlight the projected outliers.This Experiment is done on the real dataset i.e., iris data set taken from UCI Repository, it's a multivariate dataset with 150 instances and 4 attributes.Firstly the work is done on only two attributes sepal length, sepal width and then it is enhanced for another two attributes petal length and petal width to test the scalability of our model.In order to generate the exact accuracy, the absolute clusters are formed by validating the threshold value of the cluster and this threshold value is tested under various different values of radius.Then it's log-  likelihood value is estimated using the expectation step of Gaussian Mixture and finally the no of projected outliers and the time elapsed in executing them is shown in the Table 1.Consider the Fig. 3 and 4 obtained after executing the model in matlab software which shows both the clusters and outliers are also highlighted.In these two figures the clusters are shown at a proper threshold value 80% and 90% at their extreme radius points so as to resolve the cluster validity problem and to obtain the absolute clusters.
The attribute relevance analysis is done which helps in converging the large no of data points within the clusters which are marked as circles in Fig. 3 and 4 and the outliers of high projection are marked as small crosses.
Performance evaluation: Our main motive is to construct the scalable model that should be Robust enough to handle the large no. of outliers.The various parameters are calculated for the above shown results (Fig. 5 and 6).
Scalability: When the scalability is enhanced, for further more dimensions even then the performance will not degrade and the following results will be shown as in the fig.This fig shows the results when two more attributes are included in the data set.Now, the projected outliers are also enhanced at both 80 and 90% threshold value.The performance is not degraded as time elapsed is also comparatively very less.

Uniqueness:
In Table 1 and 2 every time when the model is executed, it provides unique no of outliers.At different threshold and different radius, a varying no of outliers with varying radius are shown.

Time complexity and space complexity:
The Time elapsed by the projected outliers is calculated to be less Individual time elapsed for all the cases are shown in Fig. 7. Usually, in the higher dimensional data, when the dimensionality is enhanced, the time is increased and it degrades the system's performance upto a large extent, but in our case, when the dimensionality is enhanced although the time is increased (Table 2) but still the performance is not degraded as shown in Fig. 7.The unfiltered data after delay composition will remain as it is and as it is not affecting the amplitude and hence it will not affect the system's overall performance.
Curse of dimensionality: This is the problem where the higher dimensional or projected outliers are hidden in the lower dimensional subspaces and hence their prominence is overlooked which has a major impact at the overall performance of the system.In order to resolve this problem, we have found out the individual probability densities of the clusters using Gaussian mixture and to confirm their outlierness, conditional probability is used using Bayes Probability.Hence, the projected outliers that are hidden in lower dimensional subspaces are highlightened and become more prominent.

Robustness:
The enhancement of the scalability does not compromise with the quality and quantity of the outliers; moreover, it does not enhance the time and the space complexity of the outliers, so the overall performance of the system remains unaffected, hence the system is Robust under the severe conditions.

CONCLUSION
In this study, we proposed a robust and a scalable model using hybrid approach and clustering.After constructing the clusters using K-means, different other methodologies like Distance Based, Density Based, Gaussian Mixture and ultimately Bayes probability is applied.At varying radius and threshold the quantity and the quality of the outliers is checked and the model is finally tested under various parameters like Scalability, Robustness, Time Complexity etc at initial position and when the dimensionality is enhanced, the performance is no impacted.

Table 1 :
No of the Projected outliers and their Elapsed time of various validated clusters

Table 2 :
No. of the projected outliers and their elapsed time of various validated clusters with enhanced dimensions Fig. 7: Comparison of time with amplitude than order of O(n 2 ) and the CPU burst time and the self time is calculate to be 0.48 and 0.16 milliseconds.