Prospects for Generative - Adversarial Networks in Network Traffic Classification Tasks

The paper presents an approach that allows increasing the training sample and reducing class imbalance for traffic classification problems. The basic principles and architecture of generative adversarial networks are considered. The mathematical model of network traffic classification is described. The training sample taken to solve the problem has been analyzed. The data proprocessing is carried out and justified. An architecture of the generative-adversarial network is constructed and an algorithm for generating new features is developed. Machine learning models for traffic classification problem were considered and built: Logistic regression, k Nearest Neighbors, Decision tree, Random forest. A comparative analysis of the results of machine learning models without and with the generation of new features is conducted. The obtained results can be applied both in the tasks of network traffic classification, and in general cases of multiclass classification and exclusion of unbalanced features.


Introduction
Currently, there are many models for solving traffic classification problems [17]. These can be both conventional machine learning models (random forest, catboost, lightGBM), cluster models, and the use of deep learning models -neural networks. However, most tasks in the field of network traffic classification are trained on marked-up data [18]. It is worth noting that in order to get a good, diverse volumetric dataset requires enough time and resources, which in most cases is quite problematic [19]. The main idea of this study is to use generative-adversarial networks to generate a training sample. This will significantly reduce the time for data acquisition, its partitioning and preprocessing, and improve the quality metric for a sufficiently pronounced unbalanced sample.

Generative modeling
Let there be some finite number of features X and target variables Y in the training dataset. Let z have some distribution: Then, in the generalized case, there exists a function : → that ( )~( ).
This fact served the development of generative modeling, which is a method of learning without a teacher, where there is an automated detection of various patterns and dependencies, the signs of which can be used to generate new data [7]. In this case, it is believed that the new features will be as similar as possible in their distribution to the existing data set.
A key feature of generative-adversarial networks (GAN) is the formation of a quality criterion in the form of a separate neural network [3].

Generative-adversarial network
It was with the development of generative modeling that generative-adversarial networks ( fig. 1) emerged, which consist of two neural networks: • Generator. This neural network is necessary for learning the generation of new features X.
The generator must form the features in such a way, that the discriminator could not classify it as a fake feature [2]. The discriminator is a certain criterion of quality of the generator.
• Discriminator. Its main task is to determine the real signs X, from the fake X~. That is, when the discriminator model will input a true sign, its output will be some number (in the range from 0 to 1) labeled real, and when the model will input a generated sign, the output, respectively, will be some number labeled "fake" [1]. These numerical values can be interpreted as the degree of confidence of the discriminator that the input feature belongs to either a real class or a generated one ( fig. 2). So, for example, when the input will be a real sign, the output of the discriminator class "real" should tend to 1. Since the discriminator is a binary classifier (the feature is either real or generated), it is rational to use binary cross-entropy as a loss function [3]: Thus, the quality index must be minimal for real features and maximal for generated ones [11].
The loss function for the generator can be described as:  • 'Normal', Based on the distribution of the target variable, the sample is quite unbalanced. Attacks like 'Analysis', 'Fuzzers', 'Worms', 'Shellcode', 'Generic' have less than 700 entries. It is for these types of attacks that it is reasonable to apply a generative adversarial network to generate new records in order to eliminate class imbalance [19].
It is worth noting that records where the number of missing values exceeded 40% were replaced with median values. Since there were categorical features in the training set, and obviously they carry useful information, they were converted to numerical ones using the One Hot Encoding method. The choice of the method was due to the fact that it does not assign weight to any of the features when training the model:

Logistic regression
The first model chosen for training was a simple logistic regression to determine how well the data had been preprocessed and cleaned, and to look at the importance ranking table of the features. The regularization coefficient was 0.1, the number of epochs was chosen to be 40, and the cross validation value was 10. It is worth noting that even without fine-tuning, the simple model was about 90% accurate ( fig. 5). And the average precision-recall score was 82%. A table of the importance of the signs was constructed. Figure 6 shows the top 5 indicators.

K-nearest neighbors
This method refers to the method of learning without a teacher. The number of nearest neighbors was 3. It is worth noting that the quality of the model on cross-validation was about 67%, which is slightly better than the naive Bayesian classifier ( fig. 7).

Decision tree
The main goal of this algorithm is to create a model that predicts the value of the target variable by learning simple decision rules derived from data functions [14]. The tree can be viewed as a piecewise constant approximation.    Figure 9. A section of the constructed decision tree.

Random forest
Random Forest represents an ensemble of decision tree classifiers for different subsamples of a dataset and uses averaging to improve prediction accuracy and control over fitting [16].  Separately, it should be noted that none of the models exceeded the accuracy value of 95%. In order to increase such quality criteria as precion, recall, accuracy it was decided to use a generatively adversarial network for such target variables as 'Analysis', 'Fuzzers', 'Worms', 'Shellcode', 'Generic'. The choice of variables was due to the small number of records in the training sample [20].
Below is the algorithm for generating the generated records: 1. A sample of one of the target variables was generated one by one.
2. The record data were fed to the input of convolutional neural networks without additional grouping into batches (vector dimension one) [13].
3. The generated sequence was generated at the output of the neural network.
4. The random module was used to select either a real sequence of features from the dataset or a generated sequence from the generator (CNN).
5. The set of features selected at step 4 [2] was fed to the input of the discriminator. 6. A "real" / "fake" label for the sequence was formed at the output of the discriminator. We used convolutional neural networks (CNN) as a generator [13]. Its architecture is shown in Figure  11. Figure 11. Generator architecture.
The full-link neural network acted as a discriminator. It consists of 8 dense layers, lasso regularization (0.3) to reduce the probability of overtraining, and an output of dimension 1 (label real/fake). The generated records were added to the dataset and the models considered were re-trained. The table below presents a comparative characterization of the models without the use of data preprocessing and with the use of artificially generated features for labels with less than 700 records using GAN.  It is clearly seen that the best model was Random Forest, and the use of GAN to reduce the imbalance of classes allowed to determine the types of attacks with the value of the metric of quality precision was 99%. In addition, all models have increased the completeness of correct answers, which proves the applicability of artificial feature generation methods in practice in the task of traffic classification.