Data set and machine learning models for the classification of network traffic originators

The widespread adoption of encryption in computer network traffic is increasing the difficulty of analyzing such traffic for security purposes. The data set presented in this data article is composed of network statistics computed on captures of TCP flows, originated by executing various network stress and web crawling tools, along with statistics of benign web browsing traffic. Furthermore, this data article describes a set of Machine Learning models, trained using the described data set, which can classify network traffic by the tool category (network stress tool, web crawler, web browser), the specific tool (e.g., Firefox), and also the tool version (e.g., Firefox 68) used to generate it. These models are compatible with the analysis of traffic with encrypted payload since statistics are evaluated only on the TCP headers of the packets. The data presented in this article can be useful to train and assess the performance of new Machine Learning models for tool classification.

Cryptography and Cybersecurity Specific subject area Cyber-threat and Anomaly Detection Type of data

Value of the Data
• This data may be used as a benchmark for developing Machine Learning models aimed at obtaining information about the tools that originated sniffed network traffic. Presently, no benchmark data are available for researchers wanting to perform this type of classification. These models are of interest for developers of security monitoring systems, like Intrusion Detection Systems. Several types of attacks, e.g., Distributed Denial of Service and web crawling attacks, are launched using ad hoc tools. Therefore, getting information about the tools that originate the traffic can improve the detection abilities of these monitoring systems. • These data are valuable as a data set for researchers interested in training Machine Learning models designed to obtain information about the tools that originate the sniffed traffic. Moreover, these data may serve for hyperparameters' optimization processes. • Since several of the trained Machine Learning models are based on neural networks, these data also may be used to speed up the training of new neural networks via transfer learning. • These data allow assessing the results of research presented in [1] , which first aimed at obtaining information about the tools that originated sniffed network traffic.

Data set
This section reports several statistics about the data set Table 1 . lists the tools used to generate the traffic considered in the presented data set Table 2 . reports the features that have been used to train and test the Machine Learning models.  Finally, several statistics, grouped by labels, are reported: the average number of packets and bytes sent by the client or server and the average connection duration in milliseconds Tables 3 . and 4 , respectively, report the averages for all the tools and their instances in the data set.

Classifiers
This section reports several statistics and plots about the models for classifying the traffic into various classes. Three different models have been considered for each classification task: a random forest (via the RandomForestClassifier class in scikit-learn), an extra-trees (via the ExtraTreesClassifier class in scikit-learn), and a neural network (a custom class implemented in PyTorch and skorch). The optimization process was performed using the hyperopt package using a Bayesian optimization procedure. For each classifier, the following data are reported: • The plots showing the values of the R k statistics as our Bayesian hyper-parameters optimization process progressed ( Figs. 1-9 ). • The tables listing the optimal hyper-parameters found by our Bayesian optimization process ( Tables 5 , 9 , 13 , 17 , 21 , 25 , 29 , 33 , and 37 ) -we normally used the default values for the hyper-parameters not reported 1 . • The tables reporting several classification statistics computed on the training set, development set, known tools test Set and unknown tools test set ( Tables 6 , 10             • The confusion matrices for each classifier ( Tables 7 , 11 , 15 , 19 , 23 , 27 , 31 , 35 , and 39 ).

Category classifiers
This section reports several statistics and plots about the models for classifying the traffic into categories (e.g., browser, crawler, and dos, a.k.a. network stress tools).

Tool classifiers
This section reports several statistics and plots about the models for classifying the traffic into tools (e.g., goldeneye, hulk, firefox, wget, edge, httrack, chrome, rudy, slowloris, curl, and wpull).

Tool instance classifiers
This section reports statistics and plots about the models for classifying the traffic into tool in-stances (e.g.

Experimental Design, Materials and Methods
The traffic used to generate the dataset has been captured using WireShark 2.6.4 with the tshark command-line interface. The web browsing part of the traffic dataset has been generated by manually browsing the Internet, as reported in Section 4.1 of the main paper [1] .