The Westermo network traffic data set

There is a growing body of knowledge on network intrusion detection, and several open data sets with network traffic and cyber-security threats have been released in the past decades. However, many data sets have aged, were not collected in a contemporary industrial communication system, or do not easily support research focusing on distributed anomaly detection. This paper presents the Westermo network traffic data set, 1.8 million network packets recorded in over 90 minutes in a network built up of twelve hardware devices. In addition to the raw data in PCAP format, the data set also contains pre-processed data in the form of network flows in CSV files. This data set can support the research community for topics such as intrusion detection, anomaly detection, misconfiguration detection, distributed or federated artificial intelligence, and attack classification. In particular, we aim to use the data set to continue work on resource-constrained distributed artificial intelligence in edge devices. The data set contains six types of events: harmless SSH, bad SSH, misconfigured IP address, duplicated IP address, port scan, and man in the middle attack.


Value of the Data
• The Westermo network traffic data set can be used for conducting research on cyber-security, in particular in the domain of Artificial Intelligence (AI) applications for network intrusion detection.A specific focus can be on different research areas within Machine Learning (ML), including application of various supervised, semi-supervised, and unsupervised ML techniques.Various ML problems can be addressed, including binary and multi-class classification, regression, clustering, and pattern recognition.This data set is very valuable to support the research in the area of distributed and federated learning since the data set was recorded on multiple locations in a distributed system.Local AI models can be deployed on clients with a meta-model created on a server.This specific dataset has three clients (left, bottom, right), and their data can be used to train local models for clients.The local models can be merged in the server and sent back to clients for further use.• There are multiple existing data sets that are widely used in the network intrusion detection research area.The most known ones include: KDD99 [5] , NSL-KDD [1] , UNSW-NB15 [8] , and CIC-IDS-2017 [9] .For all of those data sets, data about network packets were recorded and then preprocessed to create the features.Every entry was labeled either as normal activity or as some type of network attack.All of them were created in the simulated environment, containing normal traffic and different types of network attacks.The data set presented in this paper is created in the same manner but with some crucial differences that bring novelties compared to previous data sets: • data collection occurred in several places in the network simultaneously, • in addition to cyber security anomalies (such as port scanning), human errors (such as misconfigurations) were also included, • data was collected using an industrial control network, • preprocessed data is based on network flows instead of on individual network packets, and • two different labeling strategies were used.

Objective
The release of this data set is motivated by several factors: 1. High value to research: Realistic industrial data is frequently requested by researchers.As far as we can tell, network traffic data sets are not often collected in multiple places in the same network topology in the same experiment, which is a setup required for development of distributed/federated AI-powered network anomaly detection.Additionally, the dataset contains extra classes that have not been considered in previously published dataset.Furthermore, the dataset was collected using twelve physical devices, including industrial routers in a network topology that mirrors an industrial network.2. Beneficence in general: Releasing the data might do well as new research, algorithms, or tools could be valuable not only for the research community or Westermo as a company but also for the industry and the general public.3. Industry-academia relations: One often says that there is a distance between academia and industry; the release of data could hopefully render researched solutions more realistic and would thereby lower thresholds for industrial adoption of research artifacts, as well as simplify relations between academia and industry.

Data Description
The network traffic was collected in a physical network topology constructed to be similar to an industrial communication network, see Figs. 1 and 3 , as well as Table 3 .With the Industrial Communication System Simulator (ICSSIM)1 [3] , this network simulated a bottle filling factory: two programmable logic controllers (PLCs) interact with a water tank, conveyer belt, etc., to fill bottles one at a time, and one human-machine interface (HMI) presents the status to a human operator.For this purpose, twelve physical devices were involved: one laptop, six routers, and five Raspberry Pi (RPI) devices, see details in Table 3 .The six routers acted as the industrial communication system (the network) and ran the very common rapid spanning tree protocol (RSTP) for redundancy.One of the RPIs acted as the HMI of the factory simulator, and two acted as PLCs.The fourth RPI (SimFact in Fig. 1 ) ran the simulator for the physical world (with water tanks, etc.), and the final RPI ran the attack toolkit from ICSSIM.Before data collection started, the system was in an operational state: the nodes in ICSSIM communicated with each other over the core network.In particular, the PLC's informed the HMI on states in the factory, e.g.water level in the simulated tank, states in valves, distances between bottles and filler, etc.In addition, RSTP sends control traffic in the idle state.3 for details.

Raw data
The raw data consists of three PCAP files of network traffic collected with tcpdump.Each packet represents a packet going into or out of one of the recording devices: left, right, or bottom, see Fig. 1 .Some packets would first go into the device and then out of it, so there are many duplicated packets in the data.The physical world and factory simulator of ICSSIM was used in the data collection, and some of this traffic is not representative of a factory.For this reason, we present two sets of PCAP files: the reduced set where the traffic needed for the physical world simulator is removed, and the extended data set where it is kept.See Table 1 for an overview of the number of packets in the PCAP files, as well as an overview of the communication protocols used.When network events are triggered, this is described in a log-file with timestamps and other information needed to make sense of the network traffic.One could say that this description contains information on labels of the network as a whole (not individual packets).As an example, a MITM attack was started at 748.19 seconds into the data collection, and ended 781.87 seconds in.This was logged with UNIX timestamp, wall clock timestamp and relative time in these two lines:

87] [BAD-MITM-END] done
Labels for individual packets can be inferred from the entries and timestamps used in the log-file.The network events are described in Section 3 .

Data cleaning
To protect Westermo, the data set was analyzed prior to release.Some traffic that was unwanted, or that could possibly reveal details of various Westermo assets, has been removed.In order to prepare the reduced and extended data sets, traffic going to or from the SimFact node was removed.In these analysis and filtration steps, Python32 and Scapy3 was used.

Network flows, processed data for machine learning
Instead of analyzing network traffic packet by packet, it is convenient to analyze on a network flow level [6] .A flow can be defined as a set of packets with a common source, target, and protocol that are close in time, or in many other ways.In this data set, we have analyzed the network traffic with the ICSFlowGenerator tool for ICSSIM (ICSFlow) 4 to get information on flows [2 , 3] .The tool is implemented in Python with the Scapy library.It iterates through the raw PCAP data and creates CSV files with flow features.Here, a flow is defined as having a common source, destination, protocol and that come in a tight interval of time (500 ms 5 ).Network flows typically consist of packets with a given source and destination address (IP and port) and protocol.However, in our customized concept of network flow, we do not consider the ports, and aggregate packets with the same protocol between two network addresses because our simulation uses Modbus with fixed ports on the server side, whereas clients could use different ports which are now aggregated.Each network flow is characterized by a set of extracted features that can be classified into three categories: flow features, general features, and TCP features.Flow features encompass fundamental attributes such as source and destination addresses, as well as the flow's network protocols.General features provide information about the network traffic within the flow.For instance, they include metrics such as the number of packets sent and received, the size of packets, the length of flow, as well as the average payload of the packets.TCP features, on the other hand, are specifically extracted for TCP flows.These features comprise details about TCP flags, TCP headers, and packet delays.For a comprehensive list of all the extracted features and their explanations, please refer to the reference [2] .The extracted features are stored in CSV files, where each file contains 54 columns.Among these columns, 50 columns represent the various features, 4 columns represent the labels, and there may be some additional metadata columns.An overview of the counts of network flows per node and per data set can be found in Table 2 .

Table 2
Overview of the NST label distributions of the Network Flows per node for the Reduced and Extended data sets.In bold, the differences between the two data sets on the label distribution.In the CSV files, flows have been labeled with two different strategies: Injection Timing (IT) strategy [7] and Network Security Tools (NST) [4] .For the IT strategy, we label all traffic as anomalous during an ongoing attack, whereas the NST strategy only labels traffic from or to the attacker as anomalous.The NST strategy seems to reflect on the events more accurately.

Reduced
The main difference between the Reduced and the Extended data sets concerns the network flows annotated as normal and as misconfigurations.The data is imbalanced where 75% of the whole traffic is labelled as normal, 18.5% for misconfigurations (and bad SSH), and 6.5% for the attacks (portscan and man in the middle) as can be seen in Table 2 , the protocol distribution of the network flows for each node is represented.The most frequent protocol is TCP used in 70% of the cases and massively in the bottom node due to the PLC-to-PLC traffic, see Fig. 4 .Address Resolution Protocol (ARP) is the second most important protocol (18% of the total) from the data set equally employed by each node.Other protocols appear to a lesser extent.Concerning the extracted features for describing the network flows, a correlation heatmap is shown in Fig. 2 .The Pearson pairwise correlation coefficient is computed for the most representative features.Features are also compared with a binary representation of the traffic label (normal or anomaly).It is interesting to notice the highest correlation coefficient between features and the binary label is around 0.2 showing a relatively small linear correlation.This may suggest that nonlinear correlation exists.

Fig. 1 .
Fig. 1.Overall network topology.Pink: A1 and A2 were used for generating anomalies.Blue: Westermo devices in the controller network.Green: Raspberry Pis running HMI or PLCs of the factory simulator.Gray: Raspberry Pi running the physical world simulated factory (SimFact).See Table3for details.

Fig. 2 .
Fig. 2. Pearson correlation matrix of the extracted features and the binarized labels (anomaly detection) for the Reduced data set.

Fig. 3 .
Fig. 3. Photo of network topology used during data collection.See Table3for details.

Fig. 4 .
Fig. 4. Packet distributions on each node for the Reduced data set when removing RSTP packets.

Table 1
Overview of packets in data.

Table 3
Overview of hardware used.