NetFlow Anomaly Detection Though Parallel Cluster Density Analysis in Continuous Time-Series

Flanagan, Kieran; Fallon, Enda; Connolly, Paul; Awad, Abir

doi:10.1007/978-3-319-61382-6_18

Kieran Flanagan^18,19,
Enda Fallon¹⁸,
Paul Connolly¹⁹ &
…
Abir Awad²⁰

Part of the book series: Lecture Notes in Computer Science ((LNCCN,volume 10372))

Included in the following conference series:

International Conference on Wired/Wireless Internet Communication

1386 Accesses
2 Citations

Abstract

The increase in malicious network based attacks has resulted in a growing interest in network anomaly detection. The ability to detect unauthorized or malicious activity on a network is of importance to any organization. With the increase in novel attacks, anomaly detection techniques can be more successful in detecting unknown malicious activity in comparison to traditional signature based methods. However, in a real-world environment, there are many variables that cannot be simulated. This paper proposes an architecture where parallel clustering algorithms work concurrently in order to detect abnormalities that may be lost while traversing over time-series windows. The presented results describe the NetFlow activity of the NPD Group, Inc. over a 24-hour period. The presented results contain real-world anomalies that were detected.

You have full access to this open access chapter, Download conference paper PDF

Novel Approach for Network Traffic Pattern Analysis using Clustering-based Collective Anomaly Detection

Article 01 March 2015

Anomaly detection model based on data stream clustering

Article 01 August 2017

Study and Evaluation of Unsupervised Algorithms Used in Network Anomaly Detection

Keywords

1 Introduction

In recent years, research into new methods of anomaly detection within a network has increased in prominence. The need for a fast, reliable method to identify possible malicious activity has grown in response to emerging threats. Protecting confidential and proprietary data is of paramount importance to any organization to ensure that both legal and contractual obligations are kept. In addition, the data stored may not necessarily be the property of the company storing and handling the data. Malicious activity such as Botnets and Port Scans are increasing in frequency. These attacks, while simple, have the potential to allow for unauthorized access onto the network.

Within any organization, it is common place to use network monitoring and analysis tools to help with the detection of any anomalous behaviour on the network. Tools, such as McAfee ePO and Tipping point for example, are signature based models, which require a known example of a threat to be catalogued and a signature generated. The signature based model, while highly exact, fails if a novel attack occurs (e.g. zero-day vulnerabilities), since no previous signature exists. This limitation gave rise to anomaly based detection mechanisms. These methods require no signature database, but instead model the “normal” traffic on a network and alerts to any activity that happens outside of these normality bounds.

While much research has been conducted on various methods for anomaly based systems using a variety of approaches [1,2,3,4,5], key limitations apply when attempting to adapt these approaches to a real-time system. These include, most notably, computational cost. Within commonly used distance based outlier detection mechanisms, the need for distance based calculations for each new sample can be overwhelming for high volume data. This gave rise to optimized algorithms designed to mitigate this limitation. Algorithms such as Fast Local Outlier Factor (FastLOF) [6] and Micro-Cluster based Outlier Detection (MCOD) [7], reduce the overall cost of the range queries with varying degrees of success.

It has been shown that the application of time series can be beneficial in the detection of network anomalies. Applying time series over time-windows of increasing size has been shown to be capable of normalizing normal behaviours over time. However, at smaller time-intervals, it is possible for abnormal behaviour to traverse time windows, allowing for the possibility of becoming a false positive. This is particularly prevalent among anomalies that generate low numbers of NetFlow. It is possible for it to become hidden within other network traffic as time progresses, making detection increasingly unlikely.

Moreover, while it is possible that a large increase may occur, small deviations in established traffic behaviour may also be an indication of unauthorized activity. For example, an increase in failed login attempts may produce little difference with respect to NetFlow, it may be indicative of someone trying to guess a password. Detecting such an instance would be of paramount importance, particularly if followed by a successful login attempt [8]. This paper proposes a solution to this problem. By implementing parallel clustering algorithms, it is possible to gain a higher level of granularity while maintaining the normalization techniques gained from an incrementally increasing time window. Concurrent algorithms can detect minor deviations from established behaviour that can occur, regardless if they occur while traversing between time-windows.

In Sect. 2, an overview of related work is given. In Sect. 3, a brief overview of the technology used is presented. In Sect. 4, we propose a framework for the identification of anomalies within NetFlow data. The architecture is presented as well as an overview given on algorithms created and used. In Sect. 5, testing methodology is presented. In Sect. 6, results obtained from live data are presented and analysed. In Sect. 7, conclusions and future work are presented.

2 Related Works

Recent research into anomaly detection has largely focused on applying anomaly detection mechanisms on network data to successfully identify anomalous behaviour. Many problems still exist however. Performance is a key factor when trying to utilize anomaly detection techniques and there are many examples where this is apparent. Methods such as Principal component analysis [3], K nearest neighbour [9] and ensemble techniques [10, 11] have been used to various degrees of success in this task.

However, there comparatively expensive operations have led to a rise in clustering techniques to mitigate the calculations need when associating anomaly detection with big data [12, 13]. Using aggregated data, such as NetFlow can be used to reduce the calculations further [14]. Limitations are present with these techniques however. Kumari et al. [15] looks at a clustering technique for the use of anomaly detection over a network, setting a distance based threshold as the 100th farthest data point from the obtained cluster centroids. This threshold is a common theme across multiple anomaly detection solutions [16,17,18]. However, it can be argued that using a common threshold over all clusters within real data is non-representative of the various forms of traffic created. E.g. Traffic from different applications do not act in a similar manner.

This brings forward an interesting problem. While distance based outliers have been shown to be of significance in a plethora of works [19,20,21], we propose another indicator of possible anomalous behaviour. By monitoring cluster density over a time series, changes in underlying behaviour can be detected. Rather than only focusing on samples that are anomalous via distance based calculations, changes in the density of activity over time are also monitored.

Asmuss et al. [18] demonstrates the use and effectiveness of utilizing a time series based approach over live data. The aggregation of traffic is highly beneficial in this case, as it reduces the computational sources needed. Furthermore, it also provides a tangible benefit when comparing results across clusters. This idea of time-aware analysis has been used elsewhere also [22], and has shown to be a valuable tool in mapping continuous behaviour.

This ability to generate a normalized view of traffic over time, practically speaking, has some limitations however. The potential for an anomaly to traverse though sequential time windows can lead to the anomaly threshold not being broken, thus leading to a false positive reading. Presented in Sect. 4 is an architecture that utilizes concurrent time windows in order to mitigate the risk of this happening. Anomalies can be gathered from individual instances of the clustering algorithm, while the correlation of clustering behaviours through instances can also indicate anomalous activity.

3 Technology Overview

3.1 Cisco NetFlow

Cisco NetFlow is a system that amalgamates network traffic information into a format that successfully describes communications occurring on a network. Through a NetFlow enabled device (Fig. 1), packet traces are identified and stored as a single flow representation of a specific set of communications between two devices. These are used for multiple tasks such as network performance monitoring and used as a means of security evaluation when an incident has been detected. Visualization of the NetFlow has also proven to be of tangible benefit [23]. Using this aggregated data for anomaly detection has numerous benefits, such as data size being reduced for processing purposes and storage. For the analysis in this paper, NetFlow was used as it was found that using NetFlow for network monitoring purposes was highly common in the area [24], as well, in this case, the current infrastructure of NPD allowed for the collection of NetFlow with relative ease.

3.2 MCOD

MCOD is a distance based outlier detection mechanism that utilizes clustering to reduce the amount of distance based calculations needed to identify possible anomalies. This reduced performance cost of the algorithm makes it an ideal candidate for real-time anomaly detection over data streams when compared to other distance based algorithms such as Local Outlier Factor (LOF). As described in [7], MCOD uses a sliding window approach to identify outliers over a data stream. By using an expiring data set, the algorithm can be optimized to only use a data set that is large enough for satisfactory anomalies to be detected, while maintaining the low amount of calculations needed. Distance is calculated between cluster centroids and the NetFlow sample being queried. If the point is within this range of a cluster centroid and the cluster has the specified density, k, then the point is determined to be a non-anomaly, or inlier.

4 Proposed Framework

The architecture is a two-step approach that involves monitoring traffic at different levels of abstraction. Firstly, an adaptation of the MCOD algorithm is applied on the NetFlow data in sequential time windows (Fig. 2). This stage can outline distance based outliers contained within the NetFlow information. Following this, the clusters generated within each time-window are correlated to identify those representing similar traffic. The density values (how many input samples are contained within the cluster) are then gathered at the end of each window 5 min period.

4.1 Anomaly Detection

When using algorithms such as k-Nearest Neighbour and MCOD, a common variable that must be tuned to the dataset is k. While it’s exact use may change, the principal remains consistent. k is used to describe the limit of normality within anomaly detection frameworks. It is a single variable used to tune the classification or detection rate within an algorithm. Within MCOD, it is the value at which a micro cluster is generated and an outlier is classified as an inlier. Within the context of NetFlow anomaly detection, this is a severe limitation, as it assumes that all network traffic activity has an equal distribution across the network, which is an extremely unlikely assumption (Discussed further in Sect. 5).

In order to successfully mitigate this limitation, all cluster densities are monitored irrespective of k’s value. This allows for the successful detection of abnormal increases and decreases in cluster densities with respect to the time window being analysed.

4.2 Density Normality Measurement Generation

Within the proposed architecture, MCODT is initialized with a time window of size t. MCODT clusters data within this window, identifying anomalies using distance based calculations. The clusters generated within this time window contain samples that correspond to a type of network behaviour (Table 5). In parallel, windows are initialized in time steps of St using the same configuration as the initial algorithm (Fig. 3).

The positions of the clusters generated are then correlated with each subsequent MCODT instance to successfully capture the clusters activity through the other MCODT instances. This correlative step is needed as, due to the cluster generation step of MCODT, a cluster might not be in the exact same position throughout the sequential time windows, even though they represent the same class of network traffic (Fig. 4).

Once the clusters have been correlated and shown to be representative of the same class of network traffic, the cluster is persisted and given an ID. At the end of each instance of MCODT, the clusters density is measured and compared to its own historical activity and its activity in the other instances (Fig. 5).

This allows for the identification of anomalies, using the 3-standard deviation rule, within the persisted cluster. Furthermore, when a cluster is not generated in all the instances of MCOD, it is indicative of non-homogenous network activity. This specific type of traffic (as shown in Sect. 6) is highly irregular, and corresponds to network traffic that is extremely uncommon within the testing environment (Fig. 6).

5 Methodology

5.1 Collected Data

To successfully test the proposed method, NetFlow was collected in a 24-hour period from within NPD. This was live data, and no previous insight about this 24 h period was held. It was unknown if it held anomalies or not, simulating realworld conditions. The NetFlow contained all communications, both internally and externally, during this period. A total of 151,995,634 NetFlow samples. From these samples, 8 attributes were extracted (Table 1). And from these, 6 attributes were selected to be used in the anomaly detection calculations.

Table 1. Selected NetFlow attributes

Full size table

Source/Destination IP
Destination Port
Source Port
Destination Bytes
Source Bytes
Protocol ID

These attributes were normalized using theoretical maximums as well as observed maximums over a 3-month period (Table 2).

Table 2. Maximum values used for normalization

Full size table

The remaining collected attributes were excluded from the anomaly calculations due to various reasons. The IP addresses were excluded due to the IP address leases allocated by DHCP servers were inconsistent in both maintaining the allocated IP’s, and time-out periods for leased IP’s. This would lead to inconsistent results within networks, as IP addresses could be re-allocated in as little as 30 min, drastically changing their perceived normal traffic pattern. Instead, IP addresses were categorized as either internal or external, in order to develop separate clusters in feature space to represent internal-to internal and external-to-external traffic types. These attributes, along with the Start Time attribute, were collected for the investigation.

5.2 Program Configurations

For testing, a 24-hour example was chosen with no specific preference. No previous assumptions existed about this data before testing. The architecture was configured with an initial time window size of 5 min, and parallel instances were configured to run at one minute intervals after this, leading to a total of 5 MCOD instances processing the data in parallel. Configured variables are outlined in Table 3.

Table 3. Program configurations

Full size table

6 Results

In this section, we discuss the results of the proposed method of anomaly detection. Anomalous samples that were identified at both stages of the proposed architecture are outlined and analysed. Examples of normal activity of various types will also be presented.

6.1 Point Anomalies

Due to the two-stage architecture of the proposed method, anomalies may be detected in two different manners. Firstly, anomalies outlined by the distanced based calculations are outlined at the end of every time window. An anomaly outlined in this window contains a point that never meets the required density irrespective of time. These are regarded as Point Anomalies. Due to the relatively small time window of the MCODT instances, the number of point anomalies detected within the first hour of processing was vast. The number of additional anomalies fell rapidly over the course of the analysis. Because of this, focus was placed on point anomalies that were detected after the initial 12 h of analysis. Table 4 outlines two such samples that were correctly identified as an anomaly.

Table 4. Sample point anomalies

Full size table

Sample A represents a simple DNS request, which at first seemed like a false positive. However, upon investigation, it was shown that this DNS request was from an internal asset to an external DNS server. This incident was of interest to security technicians within the NPD Group. Sample B was an unauthorized UPnP (Universal Plug and Play) device connected to the network. It has been well documented how network security can be effected by having a UPnP device hosted on a network [25]. Upon Detection, the device was disconnected from the network.

6.2 Cluster Density Analysis

Cluster densities were measured at regular intervals. Five instances of MCODT were run in parallel, each with the same configuration settings (Table 3) The initial instance, MCODT_N0, was initialized at 00:00am on the day in question. One minute after this, MCODT_N1, was initialized, followed by MCODT_N2 and so on. This low level of analysis allows for the detection of possible malicious activity in as little as one minute after an incident Table 5 outlines sample clusters, selected based on being classified as an anomaly, and their activity over the course of one hour within the scope of all the independent instances. This totals 1440 total densities measured for 1532 clusters generated and persisted over the course of testing.

Table 5. Detected anomalies

Full size table

Anomalies were shown in both the independent analysis and correlative analysis. Of interest is an anomaly detected (Table 5, Sample 9). This anomaly appears in all instances of the algorithm, and is shown to have significant divergence from observed normal behaviours. The rapid increase, once investigated, was attributed to a single asset. It was shown to be connected to an external IP. It proceeded to attempt to open a connection to the external asset, but never received any connection. This was of interest to the security team within NPD, and was swiftly resolved. Sample 4 was also anomalous within the test. The cluster only appeared in one window, showing the extremely temporal nature of the event. The other instances did not detect sufficient activity to generate a cluster. It represents an extremely short burst of activity to an external device.

7 Conclusions and Future Work

This paper proposed an architecture designed to detect anomalies within NetFlow data. To achieve this at a micro level, a clustering algorithm was run in parallel to determine anomalies in cluster activity in time-series. It was shown to be able to detect anomalies in live data without any previous knowledge on the data. These anomalies were investigated and shown to be of security interest. The result was interesting given that the testing was conducted on real world, live data, with actionable anomalies found. Future work would include refining the extensibility of the algorithm. Due to the abstraction of the density monitoring, it is possible to add attributes to MCODT’s feature space in order to monitor changes in not only network traffic, but other metrics that could attribute to the risk of a malicious attack, such as the vulnerability of an asset determined by an external program.

References

Chen, Z., Yeo, C.K., Francis, B.S.L., Lau, C.T.: Combining MIC feature selection and feature-based MSPCA for network traffic anomaly detection. In: 2016 Third International Conference on Digital Information Processing, Data Mining, and Wireless Communications, DIPDMWC, pp. 176–181 (2016)
Google Scholar
Lin, W.-C., Ke, S.-W., Tsai, C.-F.: CANN: An intrusion detection system based on combining cluster centers and nearest neighbors. Knowl. Based Syst. 78, 13–21 (2015). doi:10.1016/j.knosys.2015.01.009
Fernandes Jr., G., Carvalho, L.F., Rodrigues, J.J.P.C., Proença Jr., M.L.: Network anomaly detection using IP flows with Principal Component Analysis and Ant Colony Optimization. J. Netw. Comput. Appl. 64, 1–11 (2016). doi:10.1016/j.jnca.2015.11.024
Ciplinskas, R., Paulauskas, N.: Outlier detection method use for the network flow anomaly detection. Moksl - Liet Ateitis 8, 327–333 (2016). doi:10.3846/mla.2016.928
Wankhede, R., Cholem V.: Intrusion detection system using classification technique. Int. J. Comput. Appl. 139, 25–28 (2016). doi:10.5120/ijca2016909397
Goldstein, M.: FastLOF: An Expectation-Maximization based Local Outlier detection algorithm. In: 2012 21st International Conference on Pattern Recognition, ICPR, pp. 2282–2285 (2012)
Google Scholar
Kontaki, M., Gounaris, A., Papadopoulos, A.N., et al.: Continuous monitoring of distance based outliers over data streams. In: 2011 IEEE 27th International Conference on Data Engineering, pp. 135–146 (2011)
Google Scholar
Purwanto, Y., Kuspriyanto, H., Rahardjo, B.: Time based anomaly detection using residual polynomial fitting on aggregate traffic statistic. In: 2015 1st International Conference on Telematics, ICWT, pp. 1–5 (2015)
Google Scholar
Iwok, P., Purwanto, Y., Suratman, F.Y.: Modified K-means algorithm using timestamp initialization in sliding window to detect anomaly traffic. In: 2015 International Conference on Control, Electronics, Renewable Energy, and Communications, ICCEREC, pp. 19–23 (2015)
Google Scholar
Goeschel, K.: Reducing false positives in intrusion detection systems using data-mining techniques utilizing support vector machines, decision trees, and naive Bayes for off-line analysis. In: SoutheastCon 2016, pp. 1–6 (2016)
Google Scholar
Uddin, M., Rehman, A.A., Uddin, N., et al.: Signature-based Multi-Layer Distributed Intrusion Detection System using Mobile Agents (2012)
Google Scholar
Vijayakumar, V., Neelanarayanan, V., Balan, E.V., et al.: Big Data, Cloud and Computing ChallengesFuzzy Based Intrusion Detection Systems in MANET. Procedia. Comput. Sci. 50, 109–114 (2015). doi:10.1016/j.procs.2015.04.071
Singh, K., Guntuku, S.C., Thakur, A., Hota, C.: Big Data Analytics framework for Peerto-Peer Botnet detection using Random Forests. Inf. Sci. 278, 488–497 (2014). doi:10.1016/j.ins.2014.03.066
Carela-Español, V., Barlet-Ros, P., Cabellos-Aparicio, A., Solé-Pareta, J.: Analysis of the impact of sampling on NetFlow traffic classification. Comput. Netw. 55, 1083–1099 (2011). doi:10.1016/j.comnet.2010.11.002
Kumari, R., Sheetanshu, Singh MK, et al.: Anomaly detection in network traffic using K-mean clustering. In: 2016 3rd International Conference on Advanced Information Technologies, RAIT, pp. 387–393 (2016)
Google Scholar
Alsayat, A., El-Sayed, H.: Social media analysis using optimized K-Means clustering. In: 2016 IEEE 14th International Conference on Software Engineering Research, Management and Applications, pp. 61–66 (2016)
Google Scholar
Velmurugan, T.: Efficiency of k-Means and K-Medoids algorithms for clustering arbitrary data points. ResearchGate 3, 1758–1764 (2012)
Google Scholar
Asmuss, J., Lauks, G.: Network traffic classification for anomaly detection fuzzy clustering based approach. In: 2015 12th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD, pp. 313–318 (2015)
Google Scholar
Abid, A., Kachouri, A., Mahfoudhi, A.: Anomaly detection through outlier and neighborhood data in Wireless Sensor Networks. In: 2016 2nd International Conference on Advanced Technologies for Signal and Image Processing, ATSIP, pp. 26–30 (2016)
Google Scholar
Fu, P., Hu, X.: Biased-sampling of density-based local outlier detection algorithm. In: 2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery, ICNC-FSKD. pp. 1246– 1253 (2016)
Google Scholar
Tsiatsikas, Z., Fakis, A., Papamartzivanos, D., et al.: Battling against DDoS in SIP: Is Machine Learning-based detection an effective weapon? In: 2015 12th International Conference on EBus. Telecommunications, ICETE, pp. 301–308 (2015)
Google Scholar
Gajic, B., Nováczki, S., Mwanje, S.: An improved anomaly detection in mobile networks by using incremental time-aware clustering. In: 2015 IFIP IEEE International Symposium on Integrated Network Management IM, pp. 1286–1291 (2015)
Google Scholar
Wong, P.C., Haglin, D., Gillen, D., et al.: A visual analytics paradigm enabling trillion-edge graph exploration. In: 2015 IEEE 5th Symposium Large Data Analysis and Visualization, LDAV, pp. 57–64 (2015)
Google Scholar
Li, B., Springer, J., Bebis, G., Hadi Gunes, M.: A survey of network flow applications. J. Netw. Comput. Appl. 36, 567–581 (2013). doi:10.1016/j.jnca.2012.12.020
Zheng, H., Li, C., Chen, Z.: Petri Nets based modeling and analysis of UPnP security ceremonies. In: 2011 Third Pacific-Asia Conference on Circuits, Communications and System, PACCS (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Software Research Institute, Athlone Institute of Technology, Athlone, Ireland
Kieran Flanagan & Enda Fallon
The NPD Group, Inc, IDA Business Park, Athlone, Co. Westmeath, Ireland
Kieran Flanagan & Paul Connolly
Faculty of Computing, Engineering and Science, University of South Wales, Treforest, UK
Abir Awad

Authors

Kieran Flanagan
View author publications
You can also search for this author in PubMed Google Scholar
Enda Fallon
View author publications
You can also search for this author in PubMed Google Scholar
Paul Connolly
View author publications
You can also search for this author in PubMed Google Scholar
Abir Awad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kieran Flanagan .

Editor information

Editors and Affiliations

Tampere University of Technology, Tampere, Finland
Yevgeni Koucheryavy
University of Macedonia, Thessaloniki, Greece
Lefteris Mamatas
Boston University, Boston, Massachusetts, USA
Ibrahim Matta
Peoples’ Friendship University of Russia (RUDN University), Moscow, Russia
Aleksandr Ometov
University of Macedonia, Thessaloniki, Greece
Panagiotis Papadimitriou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Flanagan, K., Fallon, E., Connolly, P., Awad, A. (2017). NetFlow Anomaly Detection Though Parallel Cluster Density Analysis in Continuous Time-Series. In: Koucheryavy, Y., Mamatas, L., Matta, I., Ometov, A., Papadimitriou, P. (eds) Wired/Wireless Internet Communications. WWIC 2017. Lecture Notes in Computer Science(), vol 10372. Springer, Cham. https://doi.org/10.1007/978-3-319-61382-6_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-61382-6_18
Published: 15 June 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-61381-9
Online ISBN: 978-3-319-61382-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)

NetFlow Anomaly Detection Though Parallel Cluster Density Analysis in Continuous Time-Series

Abstract

Similar content being viewed by others

Novel Approach for Network Traffic Pattern Analysis using Clustering-based Collective Anomaly Detection

Anomaly detection model based on data stream clustering

Study and Evaluation of Unsupervised Algorithms Used in Network Anomaly Detection

Keywords

1 Introduction

2 Related Works