Network Traffic Anomaly Detection Based on Incremental Possibilistic Clustering Algorithm

This paper proposed a Mahalanobis distance based Incremental Possibilistic Clustering (IPC) algorithm to detect abnormal flow. Firstly, the attributes of network flow is extracted by damped incremental statistics. Then the model of normal traffic will be generated by IPC algorithm. To extract the model of high-dimensional data without pre-known number of cluster centers, the algorithm gradually choose outliers as new clustering centers and merges the overlapping clustering centers. Finally, the data that doesn’t belong to any normal model is regarded as abnormal data. By using the Mahalanobis distance instead of the traditional Euclidean distance, the defect that the possibilistic clustering tends to find the features of hypersphere is solved. The experiments show that this method can distinguish normal flow and abnormal flow effectively and reaches the detection rate of 98%.


Introduction
Due to the gradual development of internet technology and the popularity of internet of thing(IoT) devices, network security has become a crucial issue in cyberspace. As early as 2014, there were more than 40 million security incidents on the internet, resulting in significant economic losses and leakage of personal privacy [1]. With the accessing of IoT devices, including mobile phones, IP camera and other smart sensor, network traffic has become more and more complex.Although users' network reliance is increasing continued, users still lack awareness of network security [2]. In this case, a kind of high efficiency and low consumption detection method is required for IoT devices to detect network intrusion. This paper proposes a Mahalanobis distance based Incremental Possibilistic Clustering (IPC) algorithm for anomaly detection. The attributes are extracted by damped incremental statistics instead of sliding window to reduce memory consumption. The IPC gradually detects outliers as new clustering centers, and uses possibility clustering to adjust and merge cluster centers. Finally the data that doesn't belong to any feature is regarded as abnormal data.

Related works
The anomaly detection model usually identify the attack by generate a detection model from normal traffic or the traffic combined with small percentage of abnormal packets. The detection model will divide the unknown network packets into different categories. The packet which has the characteristics of attack traffic will be assigned to the category of abnormal traffic or not fit into any category. Many researchers proposed unique detection model based on statistics, machine learning, data mining, information theory or deep learning to identify the  [1] presents an in-depth analysis detection techniques. It proposed a generic framework for network anomaly detection, and list some detection model based on classification, statistics, information theory and clustering.
Machine learning is the most popular method of anomaly detection. Damopoulos et using four traditional machine-learning algorithms (Bayesian networks, radial basis function, K-nearest neighbors and random forest) [3] and reported that the detection rate has reached 99.8%. The deep learning is the newest favorite approach. Niyaz et designed a detection model by selftaught learning [4], Meidan et use deep auto-encoders to detect anomalous network traffic. The auto-encoders encode the packet and decode it, while only the normal packet will be restored correctly [5].
Clustering, as a common data mining method, is also suitable for anomaly detection. The paper [6] use K-means, k-medoid, EM clustering and KNN algorithm to detect unknown attack. The result shows that their detection accuracies is high while the anomaly detection module produces high false positive rate. Xie et designed an anomaly detection system based on fuzzy cmeans [7], but this methods can't deal with high-dimensional data, and can't find the nonconvex structure datasets.
3. Naive clustering algorithm K-means is a common Clustering algorithm which makes members in a same class have the minimum dispersion via reassigns class members in order to obtain the best clustering results. In K-means clustering, each sample belongs to the nearest cluster center. This method is timesaving and has rapid convergence rate, but the clustering result is not very satisfactory [8]. Based on Zadeh's fuzzy theory, Bezdek J C proposed fuzzy-C-means clustering algorithm(FCM) [9]. This method extends the membership degree of k-means clustering from 0,1 to the interval [0, 1], which makes the sample affect all clusters.
The following index is minimized in FCM: where n is the total number of samples, c is the number of clusters, v i and x j is the i th andj th vector of data and cluster center. And d is the distance from x j to v i , u is the membership from x j to v i , m is a default value of FCM, which will usually be 1.5 or 2. However, FCM also has several obvious drawbacks such as sensitive to noise and need to predetermine the number of cluster centers [10]. The most severe problem is that FCM can't find the unknown because every data point must belong to one cluster obtained from normal traffic [11]. In view of the above question, the paper proposes using possibilistic clustering instead. By relaxing the limit that the sum of membership degree must be one, we get possibility value from all samples. Then the objective function is defined as: where U is the possibility partition matrix, u ij means the possibility value that x j belongs to v i . To avoid getting an infinity cluster center, the function adds c i=1 η i n j=1 (1 − u ij ) m as penalty function. η i is the pre-determined value which influence the clustering effect directly.
where K > 0 and usually is 1.

Anomaly detection based on IPC
In this section, we introduce the whole progress to detect abnormal network traffic. This section first introduces the method of incremental damping statistics, which can reduce the memory overhead of feature extraction. Then the unsupervised possibilities clustering method based on Mahalanobis distance is introduced. Finally, according to this method, an incremental possibilities clustering method is proposed to detect abnormal data.

Damped incremental statistics
Feature extraction is the first key problem in pattern recognition system. Generally, the behavior characteristic of network flow can be summarized as five fictions: source IP, destination IP, the size of packets, the sum number of packets and the time interval between each packet. There are three problems with extracting those features. Firstly, the packet arrival rate can be very high; secondly, there may be many different conversations at same time; thirdly, packets from different channels may be closely related [12,13]. It is an ordinary method to maintain a window of packet from each conversation but this method will use a lot of memory and cannot be used on router.
Therefore, we use a framework named Damped Incremental Statistics which is proposed by Yisroel, Mirsky et. This framework uses incremental statistic maintained [14] over a damped window, makes the framework has a O(1) complexity.
Suppose S = {x 1 , x 2 , · · · x n } is an unbounded stream, where x is the attribute of a packet (e.g. the size of packet). The mean, variance and standard deviation of S can be updated incrementally by maintaining the tuple I S = (N, L S , S S ), where N , L S , and S S are the number, linear sum, and squared sum of the data stream. When a new data x is caught, I S will update to (N + 1, L S + x, S S + x 2 ).
In order to extract the current status of a data stream, old data must be thrown. The naive approach is to maintain a sliding window. While incremental statistics adds a decay function: where λ · t > 0 is a default value and t is the time elapsed since the last observation from stream. So the window of the framework adds a variable T L which is the time last packet was caught. The window of damped incremental statistics will be I Sd = (n, L S , S S , T L ) . The way to update the I Sd is shown below[5]: For variables that require two-dimensional statistics at the same time(for both income and outcome packets), compute The statistcs which can be computed from the damped incremental framework is shown in Table 1.A total of 23 features can be extracted from a single time window(see Table 2). In this paper we extracts the same set of features from a total of five time windows : 0.1s,0.5s,1.5s,10s and 1min (λ = 5, 3, 1, 0.1, 0.01), thus totaling 115 features.

Incremental possibilistic clustering algorithm based on Mahalanobis distance
Under normal circumstance, the value of each attibute will stay in the boundary. When the exception, usually network attack, occurs, the value within each of the traffic features will surpass the range of threshold value. The question is, although we get a method which can improve the feature recognition performance, the clusters of normal traffic are not centralized and the shape of data may be an ultra-high dimensional irregular polyhedron. So we use an Incremental possibilistic clustering to get clusters of normal packets. Then we determine whether other packets are normal traffic. If a packet cannot satisfy any cluster extracted from normal traffic, the anomaly will be considered to happen.

Mahalanobis distance.
The Mahalanobis distance is a measure of the distance between a point P and distribution D, introduced by P. C. Mahalanobis in 1936. It is the distance of the test point from the center of mass divided by the width of the ellipsoid in the direction of the point. Compared with Euclidean distance, it considers the relevance between each dimension and is not affected by the boundary of dataset [15]. It is obviously that the 115 features which are extracted from packet are highly correlated. Using Mahalanobis distance can make the clusters more realistic and reliable. This paper needs to compute the distance between each data to each cluster. While the shape of clusters and data can be considered as same, the function based on Mahalanobis distance is devised as: where x j is the data in the dataset X = x 1 , x 2 · · · x n , v i is one of the clusters, Σ is the covariance matrix of X. While the covariance matrix may be a singular matrix, the Mahalanobis distance cannot be calculated by the inverse matrix for these datasets. We carry out the pseudo inverse matrix instead by matrix factorization.
where β is the co-variance of dataset [16]. In PCM based on Mahalanobis distance, it's recommanded to adjust the β to β = n i=1 d (x i , x)/n . Therefore, the result of clustering will be more accordant with division indicators.
Finally, the objective function of possibilistic clustering is defined as: In order to minimize the objective function, the recursion formula will be: In (6), u ij is determined by the distance between the cluster center to the sample. The range of u ij is 0 to 1.This calculation method relaxes the restriction that the sum of membership degrees is 1, which means each cluster is calculated independently. Therefore, the true minimum number of objective function will be reached if and only if every cluster center comes together. Compared with FCM, PCM needs appropriate initial values, or the result will be really unsatisfied.

Increment possibilistic clustering
In PCM, to get the correct clustering result of dataset, the number of cluster center must be more than the right number, and initial cluster center must be with enough dispersivity. Generally, it's the defect of PCM. But for network traffic whose cluster number, shape and size are unpredictable, We design an Incremental possibilistic clustering algorithm (IPC) to get correct classifications, using PCM with excess number of initial cluster center which can make clustering accurately.

IPC principle.
Suppose there is training set X = {x 1 , x 2 , · · · , x n } and clustering centers V = {v 1 , v 2 , · · · , v c }. Then calculates the degree matrix of membership U and set a threshold u min . When the membership between a data point to any cluster center is higher than the threshold, the point will be marked as a member of these clusters. Moreover, if the point cannot classified into any category, this point will be marked as outlier. This paper use normal flow to get the clusters and use these clusters with threshold to detect abnormal packets.
In order to find the cluster center and the threshold of normal packets, IPC use the iterative PCM. Firstly, use FCM to get the initial cluster centers V = {v 1 , v 2 , · · · , v c } with an initial number c. Then put these cluster centers into the PCM. The extra centers from FCM will be gathered and some outliers will be appear. The outliers here may be the truly outliers of the dataset, or the point of undetected aggregates. Therefore, IPC put these outliers into the cluster centers' set to get a new set V new = {v 1 , · · · , v c , x i 1 , · · · x im } and update the cluster number c into c + m. Finally, repeat the above actions for several times, the missing cluster centers will be retained, and fake centers from truly outliers will move to true centers. 4.3.2. realization of algorithm. Suppose there are training feature data X = {x 1 , x 2 , · · · x n } extracted from normal network traffic. The steps for training are as follows. a) Set initial value. b) Using FCM to provide initial cluster centers: c) Using PCM to cluster again and get the membership limit u min : ii) Update the membership matrix by using (6). iii) Update the clustering centers V by using (7). iv) Repeat steps (i) to (iii) until the step reach the limit or V is stabilized. d) Calculate the max membership degree of each data u max . e) Choose the lowest 3% point of u max as u min or set default u min . f) Add 1/10 of the data whose u max is less than u min to cluster centers V. g) Re-compute the PCM again. h) Repeat previous step for several times until the step reach the limit or every outliers has been added to V. i) Merging Cluster Centers to get the final detection model .

Experiments and result Analysis
This paper uses three sets of data to compare with FCM algorithm. The first is iris datasets to qualitative analysis the influence of initial cluster centers. The second dataset is a self-made dataset to analysis the ability of finding arbitrary shape clustering. Finally the data from network flow is used to test the effect on anomaly detection.

Qualitative analysis
The first experiment used iris dataset. The parameter m is set to 1.5 and the number of initial clusters is the independent variables. Figure 1 shows that when cluster center is equal to the real cluster number, both FCM and IPC can perform well. With the increasing number of initial clusters, Figure 2 shows that the wrong clustering centers appeared by using FCM while the IPC still shows a Reliable result.  The second experiment used a self-made two-dimensional datasets that data points are random distributed on a ring (r ∈ [40, 50]). The number of initial cluster center is set to 10 and m=1.5. After clustering (Figure 3), the cluster center from FCM (red circle) cannot satisfied with the real data. The IPC detect 98 cluster centers and these centers reflect the true shape of data.

IPC application to real datasets
The experiment used the dataset ISCX-Botnet 2014 created by Beigi E B et [17]. The dataset are composed of network flow from honeypot or through infecting computers with a given bot binary in a controlled environment. The dataset is including the abnormal packets and normal packets.
Firstly, we used damped incremental statistics to extract the attributes of normal network flow. 30 thousand of continued normal packets were used as training dataset. 10 thousand of packets combined with 47 normal packets and 9953 abnormal packets were used as testing dataset.This paper used MATLAB to run the IPC. Initial number of cluster center was set to 5.The parameter m was set to 1.5 and β was set to 180. Ceiling of the total steps was set to 100,000 and each turn of PCM cannot be more than 1/10 of the total steps. The FCM and PCM will stop until the movement of each point is less than 1 × 10 −4 and the whole IPC will stop when all outliers have been chosen as the cluster center or the step reach the limit. The threshold to identify the outliers is set to the 3% lowest membership after the first turn of PCM. After training, the IPC finally detect 141 cluster centers while FCM just find 5 overlapping clustering centers. the threshold to detect anomaly traffic u min = 1.9334 × 10 −5 . In order to evaluate the effect of IPC in anomaly detection, we calculate the precision and detection rate for the test. The result is shown in the table3 .
From the result, we can see that model can detect 98.8% abnormal packets, while the error rate reached 1%. The algorithm has low precision when detect the abnormal packets from mixed network traffic. In this paper, IPC based on Mahalanobis distance and damped incremental

Conclusion
This paper introduces an anomaly detection method -incremental possibilities clustering. The method extract multi-dimensional features of network flow through damped incremental statistics that reduced memory overhead. Then using Mahalanobis distance to deal with the problem in dimension. The core algorithm can reduce the constraints of parameters and make it possible to deal with high-dimension data.