TRAFFIC ANALYSIS USING NETFLOW AND PYTHON NETFLOW

: This article presents an application that is used as NetFlow collector and analyzer. It is a console application created in Python language. A software analyzer detects and analyzes incoming NetFlow messages version 1 and 5 of devices that support them. The output file is a database of information and analysis of the overall UNIX time duration of reported traffic and analysis of NetFlow lifetime. The software is developed to work with Python version 3 and higher and is designed for the Windows operating system.


Introduction
NetFlow has been invented by Cisco Systems, Inc. company [1]. It is a very popular technique nowadays and it is also widely deployed. Another type is the Internet Protocol Flow Information Export (IPFIX) and it is an IETF (The Internet Engineering Task Force) [2] protocol. Both of them are used to export flow information from routers, probes and other network devices for security, accounting, and other purposes.
The version 1 is rarely used nowadays. The version 5 adds Border Gateway Protocol (BGP) autonomous system information and flow sequence numbers. The version 7 adds support for Cisco Catalyst switches. When the Router-Based NetFlow Aggregation feature is enabled then the version 8 is used. The most recent version is 9 and supports template based extensible design [3].
Network devices send information about passing traffic using NetFlow to a collector. One example of such collector is Scrutinizer [4]. Collectors obtain information from network devices about duration, Transmission Control Protocol (TCP) and User Datagram Protocol (UDP) ports of a connection and so on.
Flows provide a continuous account of all network activity and detect attacks without signatures. It is possible to identify the certain types of network attacks and other incidents. This possibility depends on the quality of a collector. The flow-based analysis relies on used algorithms and behavior and provides zerohour detection of attacks [5].
From this perspective, the deployed algorithms are very important for identifying the certain type of incidents. From this point of view, we concentrate on the possibility to develop own algorithms to detect malicious traffic. This development demands having our own application where we can test these algorithms.

The concept and functionality of the application
We choose Python language for the development. It is a scripting language, similar to Matlab. Python has a huge base of developers and it offers many packages for scientists as are Matplotlib, Numpy, Scipy, Panda etc.
The other very important requirement for us is the possibility to work with network integrated cards. This feature is also included in Python packages.
The name of the program is GDP [6]. Its concept is based on the possibility to process NetFlow messages version 1 and 5. Both types of messages (reports) are composed out of the header and flow body. The detailed composition is presented in [7]. As we designed the flows part, we took over the distribution of the bit stream from this source. Part (2) shows detailed information about each included flow, as are flow number, protocol, source IP address, destination IP address, Type of Service, first and last time in UNIX format. An analysis of traffic is shown in the last part (3). This analysis presents summary UNIX duration time for each IP source address (column T) and also how many times this connection was observed (column C).

Terminal User Interface
All of this information is taken from a database file, in which all incoming traffic is written in a proper format. The string format is shown in Fig. 3. Items listed here are: ID, SYS_UPTIME, UNIXS, FIRST, LAST, IP_SOURCE, IP_DEST, PROTO, SOURCE_PROTO, DEST_PROTO and timestamp. Its names come from NetFlow. The program automatically stores information to the file <dataset.sqlite3>. The database file is deleted by default from the program's beginning. The preservation of historical data in the database can be changed with the configuration file. The value <YES> is necessary to change to the value <NO>.
The Terminal User Interface (TUI) is developed upon the npyscreen package [8]. We also had to use the threading package [9]. The program runs with two separate threads and with one general lock. First of all, the TUI starts after that the socket listener starts. It is necessary to run two independent loops of specified program parts. It is because socket listener is continuously listening to incoming frames and the TUI is doing a separate calculation at the same time.

Flow analysis
Accumulated time duration analysis is implemented as a first type of individual analysis. The output of the analysis is presented in previous chapter. The value is taken from saved flows in the database. The value is calculated by <SysUptime> the last package of the flow was received minus <SysUptime> at the start of the flow.

Alg. 1. Python code of duration analysis
The next step of data processing is shown in Alg. 1. The function <read_sql_query> from panda package reads the data from the database and after that the data are stacked by <head>, where 100 inputs of IP addresses are read. The data are then grouped by IP source and the function <survival_max_time_per_ip()> returns the sum of aggregated values. In Alg. 2 is presented the function <survival_read()> mentioned above. It uses sqlite3 database import and sql panda import. As is shown, the sql reads previously saved information from the database.

Traffic Lifespans
The second algorithm is used to find lifespans of each communication and compares their similarity. Survival analysis is used for this purpose. The survival analysis was originally developed to measure lifespans of an individual. This analysis can be applied to any process duration. To estimate the survival function, we used Kaplan-Meier estimator. Mantel-Cox test is used to test each traffic and observe its conformity. This research is presented in [10]. This test is not fully implemented yet in the final version of GDP application.

Conclusion
In this paper, we presented our developed application GDP used to collect and analyze network traffic from NetFlow messages. Its benefit is the possibility to add own algorithms in the source code. This program creates a base for the intended following research. The duration traffic analysis is the first of the algorithms which were implemented. The second algorithm which is partly implemented is the lifespans. Other algorithms will follow to test the theoretical conclusions of analytical capabilities of NetFlow reporting.
In the near future, we want to expand our application with NetFlow version 9 and with the IPFIX format and to deploy genetic's algorithms.