A novel dataset for encrypted virtual private network traffic analysis

Encryption of network traffic should guarantee anonymity and prevent potential interception of information. Encrypted virtual private networks (VPNs) are designed to create special data tunnels that allow reliable transmission between networks and/or end users. However, as has been shown in a number of scientific papers, encryption alone may not be sufficient to secure data transmissions in the sense that certain information may be exposed. Our team has constructed a large dataset that contains generated encrypted network traffic data. This dataset contains a general network traffic model consisting of different types of network traffic such as web, emailing, video conferencing, video streaming, and terminal services. For the same network traffic model, data are measured for different scenarios, i.e., for data traffic through different types of VPNs and without VPNs. Additionally, the dataset contains the initial handshake of the VPN connections. The dataset can be used by various data scientists dealing with the classification of encrypted network traffic and encrypted VPNs.

Encryption of network traffic should guarantee anonymity and prevent potential interception of information.Encrypted virtual private networks (VPNs) are designed to create special data tunnels that allow reliable transmission between networks and/or end users.However, as has been shown in a number of scientific papers, encryption alone may not be sufficient to secure data transmissions in the sense that certain information may be exposed.Our team has constructed a large dataset that contains generated encrypted network traffic data.This dataset contains a general network traffic model consisting of different types of network traffic such as web, emailing, video conferencing, video streaming, and terminal services.For the same network traffic model, data are measured for different scenarios, i.e., for data traffic through different types of VPNs and without VPNs.Additionally, the dataset contains the initial handshake of the VPN connections.The dataset can be used by various data scientists dealing with the classification of encrypted network traffic and encrypted VPNs.
© 2023 The Author(s).Published by Elsevier Inc.This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )

Subject Computer Networks and Communications
Specific subject area Encrypted Private Virtual Networks and their classification.Type of data Structured How the data were acquired Data was obtained by simulating real-world traffic through network traffic probes, stripped of redundant information and organized into flows in a sense of context.The packet resolution at the time level is in the order of microseconds.The data was captured from Mikrotik RouterOS routers using the open-source software pmacct.The data was exported in the IPFIX [1] format into Apache Kafka, preprocessed using our own solution ipFlowDetector and finally exported to JSON files.

Data format Raw Description of data collection
The data was measured in our network laboratory by specific conditions and scenarios described in Section 2 .The data was not rearranged but was filtered from insufficient traffic flows.[ 2 , 3 ], investigate the privacy and security risks of each VPN type [4] , or potentially develop VPN architectures that can bypass ISP and government restrictions.• The dataset includes a diverse range of VPN protocols, including L2TP [5] , L2TP-IPSEC [5] , PPTP [6] , SSTP [7] , WireGuard [8] , and OpenVPN [9] .Additionally, the dataset includes initial handshake flows for each VPN type, providing valuable information for further analysis.To the best of our knowledge, this is the first dataset to include multiple types of VPN flows beyond OpenVPN.

Objective
The dataset aims to allow researchers to study and compare the different VPN types and internet traffic.We included multiple VPN types that are less studied in the literature compared to OpenVPN, making them more visible and accessible to researchers.And since many new web services emerged in recent years, we used new and more up-to-date versions of services and websites compared to similar datasets.
The most comparable dataset ISCXVPN2016 [10] contains a limited VPN variety and outdated traffic content (given that the content used to generate the data is older than 6 years).While newer datasets such as [4] and [11] only use OpenVPN and the data is not open-source for researchers to use.Making our dataset an important asset for scientists.We divided the generated traffic into seven types of traffic: • Non-streaming: HTTP/HTTPS traffic from websites that do not contain streaming content such as videos and audios.Example websites are www.google.comor www.github.com .• Streaming: HTTP/HTTPS traffic from websites that contain streaming content like Youtube and Twitch.• Email: Traffic generated from delivering emails.
• VoIP: Traffic generated from videoconferencing services such as Google Meet.
• SSH: Traffic generated from connecting to remote servers using Secure Shell protocol.
In addition to the flows of the generated traffic, we also included the first flows of each VPN's initial connection.
Table 1 demonstrates that the dataset contains a substantial number of flows and is varied among different types of VPNs.Fig. 1 illustrates the distribution of VPN flows in the dataset, with each slice of the pie chart representing the percentage of flows for a particular VPN type on the total dataset flows.And similarly for traffic types in Fig. 2 .
The dataset is stored in the JSON format which is readable and supported by modern programming languages.The dataset on the top level is split into two folders, the first contains non-VPN flows while the second contains VPN flows.In the last folder, there are six folders for each type of VPN.The non-VPN folder contains five traffic JSON files, while each of the VPN folders has five traffic JSON files plus a JSON file containing the first flows when establishing the VPN connection.Fig. 3 demonstrates how the files are organized in the dataset and Fig. 4 shows the sizes of the dataset by VPN and traffic type.

Table 2
The description of the flow object attributes.In each JSON file, there is an array of flows.A flow is represented as an object where its principal information are stored such as protocol name and used ports.The description of the flow object is found in Table 2 .

Attribute
Inside all flow objects, there is an array of the captured packets during that flow.Each packet is represented as a JSON object.The presence of attributes in packets may differ from one flow protocol to the other.The description of the packet object is in Table 3 .

Experimental Design, Materials and Methods
In this section, we describe the environment used for establishing and collecting VPN and non-VPN and network flows ( Section 2.1 ), then we provide an overview of the data acquisition process ( Section 2.2 ).

The Data Measurement Scheme
In Fig. 5 , we show the environment used for the data acquisition.The scheme consists of five main components used for flow generation, VPN connection establishment, and captur-

Table 3
The description of the packet object attributes.

Attribute Name Description bytes
The size of the payload of the packet If the value was positive it means that the packet was in the forward direction.Otherwise, the packet was in the backward direction timestamp_start The start timestamp of the captured packet timestamp_end The start timestamp of the captured packet packets The number of the captured packets during the capturing timestamp ip_header_len The length of IP header tcp_header_len The length of the TCP header tcp_ack_number The TCP acknowledgement number tcp_flags The TCP flags tcp_seq_number The TCP sequence number ing/filtering the network flows.The detailed overview of the roles and the specifications of each component is as follows: • Virtual Machine 0 (VM0): An Ubuntu 20.04 LTS virtual machine with the purpose of generating web traffic and storing the captured flows.It receives the captured flows from the Probe passing by Client MikroTik, then saves them.It also receives and sends traffic from and to Client MikroTik .• Client MikroTik: A MikroTik RouterOS virtual machine, it plays the role of a client in the VPN mode, and it links the Router and the VM0.The VPN type is set manually in this VM.• Server MikroTik: A MikroTik RouterOS virtual machine, it plays the role of a server in the VPN mode, and it links the Router and the internet.The VPN type is set manually in this VM.• Router: A physical router hosted in the university laboratory.It links the Client and the Server MikroTik virtual machines and sends the passing packets to the probe.• Probe: A physical computer that captures the mirrored traffic coming from the router, converts the traffic into the IPFIX format, and uploads the IPFIX records to a data storage.
The MikroTik RouterOS already includes the configurations of all of the used VPNs.In the non-VPN setup, we disabled all of the VPN configurations and routed the traffic from VM0 directly through the Router .
The captured traffic from the probe is preprocessed and filtered using ipFlowDetector , a program that we made using the C ++ programming language for efficiency purposes, then finally we exported the resulting flows into JSON files and stored them in VM0 .The JSON files are later on anonymized from IP addresses and further filtered from broadcasting flows.

Traffic Generation
In our work, we divided the generated traffic into five types: streaming, non-streaming, mail, VoIP, and SSH (refer to Section 1 for each type description).The choice of this classification and the distribution of each type was mainly based on our intuition because there are few publications on the distribution of traffic types in the real world [ 14 ].
To automate the traffic generation process we created shell and python scripts.Each python script contains the automatization of a traffic type.While the shell script contains the order of commands to run ipFlowDetector program and the python script between different VPN and traffic types.The details of the automatization of each traffic type are as follows: • Non-streaming: Selenium library and Google Chrome version 104 were used.we collected a list of 1022 website URLs that do not contain streaming content, such websites are Wikipedia and Pinterest.The script opens the websites sequentially, waits for the page to load, stays on the page for a short duration then moves to the next website.
• Streaming: Selenium python and Google Chrome version 104 were used.We collected a list of 105 streaming content mostly from Youtube; the rest are from Twitch, SoundCloud, and other streaming services.Similarly to non-streaming , the script opens the 105 websites sequentially but stays in them for a longer time.• VoIP: Selenium python and Google Chrome version 104 were used.We used google meet (voice and video), with a simulated camera on the side of VM0 .• Mail: Sent multiple emails using redmail library and outlook.
• SSH: Connected to a remote terminal and executed a list of commands multiple times using spur library.
In the VPN mode, ipFlowDetector captured the initial flows of each VPN connection establishment and saved them in initial-flows.json .The motivation for including these flows in the dataset is that OpenVPN handshakes have been used as a VPN fingerprinting method [4] and it can be useful for researchers to investigate other VPNs' handshakes.After establishing the VPN connection, we started capturing the flows of the five types of traffic.
Wireguard a b s t r a c t

Fig. 1 .
Fig. 1.Pie chart of the distribution of the flows for each VPN type.Each slice of the pie chart represents the percentage of flows for a particular VPN type on the total dataset flows.

Fig. 2 .
Fig. 2. Pie chart of the distribution of the flows for each traffic type.Each slice of the pie chart represents the percentage of flows for a particular traffic type on the total dataset flows.

Fig. 3 .
Fig. 3.The structure of folders and files in the dataset.

Fig. 4 .
Fig. 4. The size of the dataset by VPN and traffic type.
captured packets in the flow

Fig. 5 .
Fig. 5.The topology of the network items used for the traffic generation and capturing.All items are described in detail in the list above.

Table 1
[13]number of flows and the size of the dataset for each VPN traffic type (without counting the initial flows).The dataset consists of labeled network traffic.The traffic is either a VPN traffic or a non-VPN traffic.The VPN traffic is generated via a set of different VPN types:• PPTP: Point-to-Point Tunneling Protocol built by Microsoft, it operated in Layer 2 of the OSI Model[12], with not sufficient encryption level[13].• L2TP: Layer 2 Tunneling Protocol (in the OSI Model) without data encryption or strong authentication.• L2TP-IPSEC: Layer 2 Tunneling Protocol, encrypted using IPSEC protocol (NAT-Traversal mode).• SSTP: Secure Socket Tunneling Protocol, based on HTTPS and operates at the application layer of the OSI model.• WireGuard: Modern and open-source VPN protocol that operates on layer 7, utilizes state of art cryptography techniques, and uses UDP as its transport protocol.• OpenVPN: Modern, popular, and open-source protocol that operated on layer 7. It is used primarily for end-user connections.