Unsupervised Anomaly Detection via DBSCAN for KPIs Jitters in Network Managements

: For many Internet companies, a huge amount of KPIs (e.g., server CPU usage, network usage, business monitoring data) will be generated every day. How to closely monitor various KPIs, and then quickly and accurately detect anomalies in such huge data for troubleshooting and recovering business is a great challenge, especially for unlabeled data. The generated KPIs can be detected by supervised learning with labeled data, but the current problem is that most KPIs are unlabeled. That is a time-consuming and laborious work to label anomaly for company engineers. Build an unsupervised model to detect unlabeled data is an urgent need at present. In this paper, unsupervised learning DBSCAN combined with feature extraction of data has been used, and for some KPIs, its best F-Score can reach about 0.9, which is quite good for solving the current problem.


Introduction
In order to ensure the continuous operation of the business and avoid exceptions in their networks [Li, Cai and Xu (2018) ;Luo, Wang, Cai et al. (2019); Erfani, Rajesegarar, Karunasekera et al. (2016)], many internet companies operate huge operations and maintenance departments to monitor KPI (key performance indicators) data in their system. KPIs are time series data, measuring metrics such as Page Views, number of online users, and number of orders. Companies also use some auto-monitors to save human and material resources, such as Zabbix [Olups (2004)]. The current KPIs monitoring tools are still based on thresholds, and thresholds method closely dependent on the experience of engineers, a long-term operation expert familiar with business in an industry will manually summarize repeated, traceable phenomena, forming rules to set new thresholds. For some new services, or periodic data, the threshold is difficult to meet the actual needs, there will be a lot of anomalies cannot be effectively monitored. KPIs (Key Performance Indicators, as shown in Fig. 1) are important indicator of monitoring system operation [Kang, Zhao, Li et al. (2016)]. KPI anomaly detection is a key technology of intelligent operation and maintenance of Internet services. Most other key technologies of intelligent operation and maintenance depend on the KPIs. When the KPIs presents abnormal situations (such as sudden increase, sudden drop, jitters, etc.), the situations often mean that some potential failures occur in its related applications, e.g., network failure, network overload, server overload, external attacks.

Figure 1: Different trends of KPIs
In the current academic world, lots of methods have been proposed on KPI anomalies detecting [An and Cho (2015); Park, Hoshi and Kemp (2018) ;Park, Erickson, Bhattacharjee et al. (2016); Bodin, Malik, Ek et al. (2017) ;Laptev, Amizadeh and Flint (2016)]. Intelligent anomaly detecting emphasizes that the machine learning algorithm automatically learns from the massive KPIs (including the event and the manual processing logs of the operators) and constantly refines and summarizes the rules. In order to detect the anomaly of the system, machine learning algorithms are used excessively dependent on labels. That is unrealistic to manually labeled huge quantities of KPIs to detect abnormal situations [Chandola, Banerjee and Kumar (2009)]. In addition, the supervised methods cannot effectively detect the new anomalies because of their labels cannot labeled timely. In this paper, we propose an unsupervised learning method to effectively monitor KPI and detect short-term jitters in persistent KPI, which is different from normal KPI. The main contributions of this paper can be summarized as follows: 1. The detection algorithm is weakly dependent on the label, and it can detect the abnormal data in the extreme situation without label. 2. It can effectively detect new anomalies. In supervised models, the existing anomalies in the training samples can be detected, but it is insensitive to the new anomalies. However, in unsupervised models, new anomalies can be effectively detected. 3. The universality of the method can detect anomalies in different KPIs.
2 Related works Traditional statistical models. In industries, lots of anomaly detection methods based on traditional statistical models are widely used to detect abnormal situation in their systems or web applications. But traditional statistical models are overly dependent on expert to pick a suitable detector for a given KPI. It is well known that a large number of KPIs with different trends mean that it is not possible to select a suitable detector for each KPI. The choice of multiple detectors will consume a lot of work effort, which is also a huge waste of resources for the company. Therefore, these detectors do not look satisfactory. Supervised ensemble approaches. To circumvent the hassle of algorithm/parameter tuning for traditional statistical anomaly detectors, supervised ensemble approaches [Fontugne, Borgnat, Abry et al. (2010)], EGADS [Laptev, Amizadeh and Flint (2015)] and Opprentice [Liu, Zhao, Xu et al. (2015)] were proposed. They train anomaly classifiers using the user feedbacks as labels and using anomaly scores output by traditional detectors as features. Both EGADS and Opprentice showed promising results, but they heavily rely on good labels (much more than the anecdotal labels accumulated in our context), which is generally not feasible in large scale applications. Furthermore, running multiple traditional detectors to extract features during detection introduces lots of computational costs, which is a practical concern. Unsupervised approaches and deep generative models. Recently, there is a rising trend of adopting unsupervised machine learning algorithms for anomaly detection, e.g., one-class SVM [Sarah, Sutharshan, Shanika et al. (2016)], clustering based methods [ Fu, Hu and Tan (2005)] like K-Means [Münz, Li and Carle (2007)] and VAE [Xu, Chen, Zhao et al. (2018)] and DBSCAN [Harisinghaney, Dixit, Gupta et al. (2014)]. The main idea is based on normal data rather than abnormal data: because the main component of KPI is positive data, models can be readily trained even without labels. Roughly speaking, these models all first identify the normal region in KPI, and then distinguish anomalies by measuring the distance from the normal region. Along the direction we discussed above, we are interested in DBSCAN models for the following reasons. First, KPIs are unlabeled data, and we aimed at detecting anomalies in data according to the normal pattern of KPIs. Learning normal patterns of KPIs can be seen as learning the distribution of training data. Second, DBSCAN model is widely used for clustering arbitrary shape dense data sets. Third, in some KPIs, DBSCAN model can get very good anomaly detection results, and f-score is significantly higher than other models (as shown in §4). Fourth, simply adopting much more complex models [Sölch, Bayer, Ludersdorfer et al. (2016)] based on VRNN shows long training time and poor performance in our experiments. Fifth, DBSCAN algorithm is not affected by abnormal samples unlike k-means, it can automatically detect abnormal samples.

Problems and solutions
In this section, we first describe some of the problems that exist in the field of anomaly detection, and then introduce the methods we used in solving the problem. Finally, based on these methods, an effective unsupervised detection method is proposed to solve the above problems.

Problems
In summary, the existing anomaly detection algorithms have some problems, such as algorithm selection/parameter adjustment, high dependence on tags, poor performance and/or lack of theoretical basis. The existing methods are either supervised or unsupervised, and the detection efficiency of the unsupervised models are not very satisfactory in some situations. However, in our article, we effectively extract static features from a data window, and select unsupervised model, which can effectively monitor anomalies without relying on labels, and achieve better results. The problems of this article are stated as follows. We aim at using an unsupervised anomaly detection algorithm to detect anomalies in KPIs with less labels or without labels, and this method achieves satisfactory results. Because of the good performance of DBSCAN method in a variety of unsupervised methods, we chose to start our work.

Solutions
3.2.1 Data pre-processing Windowed data. Anomaly detection of KPIs requires timeliness. Abnormal data are detected in a certain period of time can be applied to actual production. Therefore, window transformation is used in this paper. Sliding KPI data from beginning to the end, time series data are transformed into windowed sequence data [Sun, Ge, Huang et al. (2019)]. The exception condition of a window indicates that there is an exception in this window. The appropriate window size can effectively report the exception within a limited certain time. Extract KPI's static feature. In the application of models, the static characteristics of data are used to predict. The statistical characteristics of each window are extracted, such as variance and mean. Then, the features in a window represent the features of each point as input to the DBSCAN algorithm. In most cases, outliers are different from normal points (such as sudden increase and sudden decrease), and their combination features correspond to those far away from most normal points. So, outliers can be detected. The outlier samples in the algorithm output correspond to the outliers in KPI.  (2007)] DBSCAN algorithm is a density based unsupervised clustering algorithm, also, DBSCAN is a very efficient and effective clustering algorithm [Januzaj, Kriegel and Pfeifle (2003)]. The algorithm divides the closely linked samples into one class, thus forming different categories. The advantage of this algorithm is that it is insensitive to outliers and can find outliers in samples. Unlike K-means [Kanungo, Mount, Netanyahu et al. (2002)] clustering, initial values need to be set to determine how many classes to classify, DBSCAN algorithm automatically classifies samples into different categories by adjusting distance of parameters ɛ and neighborhood sample number threshold MinPts. The DBSCAN Algorithm is shown as follows:

Input:
1: D: a data set containing n object 2: ɛ: the radius parameter 3: MinPts: the neighborhood density threshold Output: a set of density-based clusters Algorithm start 1: make all objects as unvisited 2: while there are objects that are not visited do 3: randomly select an unvisited object p 4: mark p as visited 5: if the ɛ-neighborhood of p has at least MinPts then 6: create a new cluster C, and add p to C 7: let N be the set of objects in the ɛ-neighborhood of p 8: for each point p in N do 9: if p is unvisited then 10: mark p as visited 11: if the ɛ-neighborhood of p has at least

Data process and feature extraction
Data sets in our experiment come from the data sets provided in Xu et al. [Xu, Chen and Zhao (2018)] and the data sets provided by the anomaly detection contest. These data generated by the actual monitoring of Internet companies. It is of practical significance to detect anomalies from them. All KPIs have an interval of 1 minute or 5 minutes between two observations. Each KPI has three record values: time stamp, value and label, in which label is the value (0, means abnormal/1, means normal) of each moment. Labels are tagged by engineers who maintain the normal operation of the machine and have rich experience in their business, which can help us to evaluate the results in our test phase. In our paper, we choose 3 data sets, denoted as A, B, C, so we can evaluate the methodology for noises at different levels. After testing, we set window size=5, and we compensate 0 for missing data directly. The feature of each window only extracts the feature of variance (marked as var) and difference (marked as diff) combination. Calculate the variance var of the window, the first order difference diff of each point, and then get the value combination feature: var*diff*window size. So, each point is converted into a combined feature as input to the algorithm. Transformed KPI sequences can be used easily in our model. After transformation, the features of outliers are obviously different from those normal points, which is also very helpful for clustering algorithm to distinguish outliers in the next step. It will be discussed in detail in fifth chapter

Algorithm application and result analysis
Features extracted from data pre-process step can be used easily as input of DBSCAN algorithm. After clustering, different clusters can be obtained. The anomaly samples are divided into one cluster and used as the result of anomaly detection. After adjusting parameters ɛ and MinPts, we can achieve a better result and have a certain generalization ability when we set ɛ=0.05 and MinPts=20. We use the same evaluation index F-score that used in Xu et al. [Xu, Chen, Zhao et al. (2018)]. F-score of A (one KPI), B (CPU4), C (server) are 0.971, 0.932, 0.937. As you can see, the algorithm performs very well on the three data sets. Generally speaking, after a lot of experiments, we find that this algorithm works almost the same on some data sets as the VAE-based algorithm [Xu, Chen, Zhao et al. (2018)], and even performs better on some data sets. In contrast to Fig. 2 and Fig. 3, we can see that the algorithm has successfully detected most of the outliers. According to the evaluation method in paper, as long as the first anomaly point is detected within a small delay for a continuous anomaly interval, then this section is considered to be successful in detecting all the anomaly points.

Impact of ε and MinPts
Parameter ɛ and MinPts plays an important role. The ɛ means maximum distance between two samples for them to be considered as in the same neighborhood, and the MinPts means the number of samples (or total weight) in a neighborhood for a point to be considered as a core point, which determine the effect of clustering. Too large or too small would probably cause bad results. In Fig. 4 and Fig. 5, we present the F-score with different ɛ and MinPts on data set A, B and C. It can be seen from Fig. 4 and Fig. 5

Analysis and discussion
Why the methods in the paper have a surprising effect on many KPIs such as data set A, B and C? We compared the characteristics of the original data after transformation in Fig. 6. As you can see, after transformation, the features of outliers are obviously different from those of normal points, which is very helpful for the clustering algorithm to distinguish outliers, because clustering algorithms tend to group data with the same characteristics. Figure 6: The original KPI sequence (above) and the transformed variance difference combination feature sequence (below). Red dots denote abnormal points In fact, we found that most exceptions in KPIs are due to jitters after a lot of observation and experiments. Data jitters means that data suddenly rises or falls. We believe that the reason these data points are marked as abnormal is that such data jitters deviates from most data, which means something unusual must have happened at the current moment.
Therefore, it is very necessary and useful to detect the exception for the actual business analysis. When we extract variances and differences and form combined features, the distance between abnormal data and normal data is actually amplified because of the data jitters, so a good result can be obtained when clustering. There are some drawbacks to this approach [Khan, Rehman, Aziz et al. (2014)]. Not all KPIs achieve good results, although many KPIs do. Because there are some normal jitters in data, such as periodic jitters, they are not considered abnormal. This method does not take into account the global characteristics of the data and cannot distinguish normal jitter from abnormal jitters. And the extraction and combination of features may highlight features of normal data points so much that they are classified as exception classes. As we all know, finding a method that works for all KPIs is difficult. In addition to these, this algorithm is very memory intensive due to the establishment of the KD-Tree during the optimization process when there is a lot of data. That leaves us with future work.

Conclusion
In current research, supervised and unsupervised machine learning methods are widely used in various fields [Liu, Liu, Liu et 2019)]. This paper proposes an unsupervised anomaly detection algorithm independent of labels for KPIs (time series data). Firstly, KPIs are transformed into data that contains original features and is convenient for model use by using window data and feature-specific acquisition method. Then, DBSCAN unsupervised algorithm is used to detect jitters in KPI and detect anomaly. DBSCAN algorithm tends to detect jitters anomalies (sudden increase and sudden decrease) that do not take into account global data characteristics. Of course, the algorithm process has some shortcomings in periodic anomaly detection, which can be further improved in the future. At the same time, we are also working on efficient operation of the algorithm on edge terminal devices, which can quickly detect system anomalies [Liu, Guo, Cai et

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.