Improving the quality of generative models through Smirnov transformation

,


Introduction
In many different domains, the application of machine and deep learning (MDL) techniques requires the availability of considerable amounts of data to take advantage of their powerful learning processes. In addition, the applicability of MDL algorithms requires to take into account the evolution of data patterns over time, which implies to produce periodically additional volumes of relevant data for training new MDL models. Furthermore, producing the required volumes of data in industrial scenarios usually represents a considerable drawback since data gathering and processing tasks tend to be optimized to guarantee services and billing. Even if efficient mechanisms for generating and labelling data sets can be implemented, data are increasingly protected by the legal regulations that governments impose to guarantee the privacy of their contents (e.g., European General Data Protection Regulation (GDPR)). These restrictions may discourage the use of real data sets for MDL training and validation purposes.
To address this problem, in the last decade, Generative Adversarial Networks (GANs) [9] have gained significant attention due to its ability to estimate the underlying statistical structure of high-dimensional data and generate synthetic data simulating realistic media such as images, text, audio and video [35,8,11,22]. Nowadays, GANs are broadly studied and applied through academic and industrial research in different domains beyond media (e.g., natural language processing, medicine, electronics, networking, and cybersecurity).
Nowadays, an increasing number of industrial applications need a mechanism to generate high-quality synthetic data that can fully replace real data in machine learning tasks to avoid privacy violations that might appear when using real data for training and testing purposes. Therefore, providing a mechanism to generate high-quality labelled data sets that do not incur in privacy breaches will foster cross-development of MDL components by third parties. For example, a telecom provider developing ML-based components to be part of an IDS, can import high-quality synthetic data from a telecom operator to train and validate these ML-based components. As the synthetic data was generated from real data using GANs, the ML component after training will reach the desired level of performance and furthermore, no breach of data privacy will be raised as the telecom operator is not sharing any real data with the telecom provider.

Contribution
The main contribution of this work is to introduce a novel application of custom activation functions to the last layer of GAN generators to obtain synthetic data that replicates with high fidelity the statistical behaviour of real data. We provide evidence of the robustness of this solution in several scenarios, including when dealing with data that contain discrete features (i.e., variables that follow a discrete distribution). The proposed custom activation function is based on the Smirnov transform (ST) and allows the GAN generator to perfectly mimic any kind of real data distribution.
Roughly speaking, given an initial random variable Y and a target random variable X, the Smirnov transformation is a real-valued function S Y →X : R → R such that the transformed random variable S Y →X (Y ) distributes as X. In other words, S Y →X is a function that 'bends' the shape of the distribution of the input random variable Y and turns it into the distribution of the output X. Notice that the function S Y →X is completely deterministic with no stochastic behaviour, and moreover, derivable, which allows it to be seamlessly integrated in the backpropagation computations during the GAN training processes.
We apply this transformation to the case of the output of a GAN. Empirically, it can be observed that the outputs of a generative network that uses linear activation in the units of the last layer tend to follow a normal-like distribution (e.g. unimodal, with non-compact support, approximately symmetric, etc.). This behaviour discourages its use when the variables to be replicated are discrete or follow complex continuous distributions. To bend this output function to fit the actual distribution to be generated, we propose to use the Smirnov transform S N →X as the activation function in the units of the last layer of the GAN to convert each normal distribution into the objective distribution X. This function S N →X can be computed per output unit beforehand just by analyzing the training data set so that it is fixed before starting the training process of the GAN.
To show the effectiveness and utility of this proposal, we have conducted the following analysis and obtained the following conclusions: 1. We empirically demonstrate that the data generated with our solution presents a high quality comparable with the original data. This generative power is tested through two different data sets: a) A rendered data set containing a mixture of discrete and continuous variables and b) a real data set of flow-based network traffic data containing benign user connections and cryptomining attacks.
2. We provide empirical evidence of the quality improvement of our ST-based GANs with respect to standard GANs and simple mean-based generators. This data fidelity is assessed in two different ways: a) From a statistical perspective, we use two distance functions (based on the L 1 distance and Jaccard index) that allow us to quantitatively and graphically compare the evolution of the quality of the data generated by ST and standard GANs during the GAN training; and b) from a practical viewpoint we test the performance of a nested Machine Learning (ML) classifier when the synthetic data generated by ST-based and standard GANs completely substitute real data in the training of the ML classifier, comparing the performance of both ML classifiers when they are tested with real data.
3. The results obtained in these tests clearly outperform the existing solutions in the literature. Our proposal generates such a high-quality synthetic samples that can completely replace real data without leading to a fall in the performance of the nested model. To the best of our knowledge, this is the first work in which this goal is achieved outside the image generation domain.
4. Due to the ill-convergence of the GAN training, we do not use as stopping criterion the usual procedure based on stopping the training after a fixed number of epochs. In substitution, we introduce a novel approach based on the observed performance of a nested ML task that uses the synthetic data produced by the generator at each epoch.
Beyond the high quality of the obtained data itself, this approach has several collateral relevant consequences in GAN generation procedure, among which we can highlight the following: 1. Our solution provides a more general approach that deals not only with the replication of categorical variables but with any type of data distribution (continuous or discrete). To convert a standard GAN into a ST-based GAN we only need to change the activation functions of the last layer units of the generator by the corresponding IST functions. The IST activation functions are pre-computed before starting the GAN training and therefore, the additional computational effort can be considered negligible in the context of a full GAN training. In addition, when categorical or discrete variables are considered, it is not necessary to adapt the architecture of the generative network to the structure of the data, as it happens in the existing solutions. On the contrary, we only need to change the activation function of each output neuron in the generator with the corresponding ST activation, independently of the number of categories of the replicated variable.
2. The proposed ST-based activation function is derivable and therefore, it can be seamlessly integrated it in the backpropagation computations during the GAN training processes.
3. The generation procedure avoids the privacy violations that could appear when real data is used for different tasks (e.g., in MDL training and testing processes, and when sharing data in cross-developments or federated environments).
4. The generative power of our solution can be used to alleviate the existing shortage problem of available (labelled) data sets in many domains.

Brief review of GANs
Let Ω be a probability space and let us consider a random vector X : Ω → R n . Typically, neither the space Ω nor the variable X are fully known, and only very limited access to them is provided, for instance, through a collection of samples. The aim of a generative model is to 'replicate' X as reliably as possible, but admitting certain variability. In other words, the goal is the following: given a fixed probability space Λ, called the latent space, creating a random vector G : Λ → R n such that G is 'near' to X in some sense (typically, in distribution).
The proposal of a Generative Adversarial Network (GAN) is to tune two different functions: a generator G : Λ → R n and a discriminator D : R n → R. The generator G will be trained to generate as faithful samples to X as possible, while the discriminator D is a binary classifier optimized for distinguishing between 'real' samples of X and 'fake' samples generated by G. Throughout this paper, we shall adopt the convention that the output of D is the likelihood of a sample x to be real, measured in logits. In other words, if σ(t) = (1 + e −t ) −1 is the logistic function, then σ(D(x)) = 1 means that D believes that x is a real sample whereas log(D(x)) = 0 stands for D believing that x is a fake sample.
The classification error suffered by the classifier D is thus where E Ω and E Λ denote the mathematical expectation on Ω and Λ respectively. It is customary to weight this error with an increasing concave function f : R → R so that we instead consider the error function of D (1) Typical choices for f are f (s) = log (σ(t)) as in [9], or f (s) = s as in the Wasserstein GAN (WGAN) [2].
To improve this error, we will suppose that both G and D depend on some parameters (usually, they are implemented as neural networks) and we shall adjust these parameters to optimize the error E. However, notice that the parameters of the discriminator D are trained to minimize E, whereas the generator G aims to cheat D, so it seeks to maximize E. This gives rise to the competitive game Beware of the sign conventions. Sometimes in the literature, the error function (1) is weighted with the decreasing function f (−s), so the resulting cost function is equivalent to the one presented here but the objectives of the functions G are exchanged: G aims to minimize it while D tries to maximize it.
In this way, the goal of the training process of a GAN system is to look for Nash equilibria of (2). These are pairs (G 0 , D 0 ) of a generator and a discriminator such that the function G → E(G, D 0 ) has a local maximum at G = G 0 and the function D → E(G 0 , D) has a local minimum at D = D 0 . Nash equilibria exhibit very good theoretical properties of probabilistic nature. For instance, in the assumption that a perfect discriminator is reachable at a Nash equilibrium, the Jensen-Shannon divergence between the synthetic distribution G and the original one X is minimized. Similarly, Wasserstein's Earth-moving distance is minimized when a WGAN is applied [2].
Despite of that, the problem of finding Nash equilibria in GANs is still essentially open. In the seminar paper [9], it is proposed a simple optimization procedure by alternating gradient descend optimization. However, as pointed out in [18], the game (2) to be optimized is not a convex-concave problem, so in general the convergence of these simple training methods is not guaranteed. To stabilize this training process, several heuristic methods have been proposed, such as feature matching, minibatch discrimination, or semi-supervised training [28]; the introduction of spurious noise [30,1] or the application of regularization methods based on gradient penalty [27].
Finally, it is worth mentioning that most of these training methods have been designed to be applied to the case in which the input data X are graphical images. In this scenario, the statistical properties of the color distribution among the pixels foster the convergence of the GAN, which may achieve high quality results. Nevertheless, when the data to be replicated exhibit different statistical properties (say, categorical features, heavy tailed non-normal distribution, strong domain restrictions...), to the best of our knowledge no general purpose training method is known and the results are typically very poor [17].

The Smirnov transform
Borrowed from the field of theoretical probability, there is a mathematical transformation that will be very useful for our purposes. In this section, we shall outline the main concepts and properties of this map. For further information, please refer to [26] or [7].
Suppose that X : Ω → R is a random variable and let F X (x) = P(X ≤ x) its cumulative distribution function. In the case that F X : R → [0, 1] is a continuous increasing function, then its inverse F −1 X : [0, 1] → R is a well-defined function called the quantile function. Otherwise, we can still define an analogue of the quantile function by setting Recall that F X is non-decreasing and right continuous, so the set F −1 X ([p, ∞)) is actually a final segment including the leftmost endpoint. The value F −1 X (p) is thus nothing but the infimum of this set, which is actually a minimum by right continuity. A key feature of the quantile function is the following result. Proof. This is a very well-known result whose proof can be found, for instance, in [7]. We include here a brief proof for the convenience of the reader. Notice that, for any x ∈ R and 0 ≤ p ≤ 1, we have that min z {z | F X (z) ≥ p} ≤ x if and only if x ∈ {z | F X (z) ≥ p}, which means that p ≤ F X (x). Therefore, we get that, for all x ∈ R Thus, the cumulative distribution function of the random variable F −1 X (U ) coincides with F X , as we wanted to show.
There is also a partial reciprocal result, but it requires to impose extra assumptions on F X . Proof. Again, the proof can be found in [7]. Since F X is increasing and continuous, its punctual inverse is well defined. Hence, for all x ∈ [0, 1], we have Thus, P (F X (X) ≤ x) = x for all x ∈ [0, 1], while P (F X (X) ≤ x) = 0 for x < 0 and P (F X (X) ≤ x) = 1 for x > 1, which shows that X distributes as a uniform random variable.
Some remarks are in order. In the previous proposition, the hypothesis that F X is invertible is actually too strong. Repeating the type of arguments of Proposition 3.1, it can be shown that F X (X) ∼ U [0, 1] provided that P (X = x) = 0 for any x in the support of X. This holds, for instance, for all continuous random variables. In the case that F X is discontinuous, a closed formula for the cumulative distribution function of F X (X) can be still obtained. Indeed, if x ∈ R is a continuity point, then P (F X (X) ≤ x) = x as usual; but if x ∈ R lies in the middle of a jump singularity in which F X jumps to a value x + > x (in other words, x + = F −1 X (F X (x))), then we have that P (F X (X) ≤ x) = x + . Propositions 3.1 and 3.2 allow us to transform any random variable into another distribution desired. We state this as a theorem since it is crucial for our later developments. The proof is just a straightforward combination of the aforementioned results. Theorem 3.3. Let X be an arbitrary random variable and let Y be a continuous random variable. Set F X and F Y for the cumulative distribution functions of X and Y , respectively. Then, we have that is a random variable that distributes as X.
Notice that Proposition 3.1 does not require any assumption on the distribution of X, so the target distribution may be anything. However, to apply Proposition 3.2, we need that the original variable Y is continuous. Recall that, in the case that F X is not continuous or increasing, the quantile function F −1 Y is defined as in (3).

Empirical estimation of the Smirnov transform
Let us consider the scenario of Theorem 3.3, in which we want to transform a random variable Y into another random variable X. Typically, the distribution of Y will be known, but the actual distribution of X might be unclear (for instance, because it is a very involved phenomenon).
To address this issue, we propose to estimate it through the so-called empirical cumulative distribution function, denoted byF X . To this purpose, we shall have access to a collection of samples x 1 , . . . , x m of X, and we defineF X bŷ where χ A is the characteristic function of the set A, that is, χ A (x) = 1 if x ∈ A and χ A (x) = 0 otherwise.
By the Glivenko-Cantelli theorem, when n → ∞, the empirical cumulative distribution functionF X converges to the real distribution function F X in the L ∞ distance almost surely. This means that, for large n,F X is a very good estimator of F X . In particular, we can approximate the Smirnov transform S Y →X (Y ) by the empirical Smirnov transformŜ

Smirnov transform as GAN activation
After this probabilistic digression, let us come back to the problem of training GANs. Suppose that we want to generate a random vector X = (X 1 , . . . , X n ) of which only some samples x 1 = (x 1 1 , . . . , x n 1 ), . . . , x m = (x 1 m , . . . , x n m ) ∈ R n are known. For this purpose, we want to train a GAN made of a generator G = (G 1 , . . . , G n ) : Λ → R n and a discriminator D : R n → R.
The key problem is that, typically, without further tuning, the output distribution of each of the random variables G i : λ → R is approximately normal. This is related with the mode-collapse problem [32], a well-reported behaviour of the GANs in which, when they try to generate non-normal variables, the output data tend to not be variate, and the generator degenerates to synthesize only small variations of a prototypical example of X. In some sense, the network G stucks in a particular example of X that cheats D very efficiently and ignores any other type of data to generate. This is optimal from the point of view of the GAN game (2), but leads to a degenerate behaviour in which the real distribution of X is not recovered.
To address this problem, in this work we propose to facilitate the job of the generator G by using as activation function a customized function able to capture the statistical subtleties of X. To be precise, let us denote by F N the distribution function of the standard normal distribution, that is Additionally, using the sample x 1 = (x 1 1 , . . . , x n 1 ), . . . , x m = (x 1 m , . . . , x n m ) ∈ R n of the n-dimensional random vector X = (X 1 , . . . , X n ) we generate the n marginal empirical cumulative distribution functionsF X 1 , . . . ,F X n . With this information, we create a new activation function as the juxtaposition of the Smirnov transformations from the standard normal distribution to X î To speed up the evaluation of the functionŜ X , as well as to avoid the vanishing gradient problems derivated from the piecewise constant nature of the functionsF −1 X 1 , instead of using S we shall interpolate each of the component functionsŜ N →X i through a numerical interpolation method (typically, spline interpolation [6]) to get approximate functionsS N →X i . Now, the interpolated global function is given bỹ In this way, as the activation function of the output layer of the generator network G we shall use the functionS X . The gradients of the error suffered by this function will be propagated towards the initial layers of the network as usual in the usual backpropagation algorithm. Notice that, provided that the interpolation method returns C 1 -functions S N →X i (i.e. derivable with continuous derivatives), then the activation functionS X is a differentiable map. This is the case, for instance, of spline interpolation, allowing us to deal with discrete distributions even though their underlying quantile function is not smooth.
As a final comment, notice that this Smirnov transformation converts random vectors with normal marginal distributions into random vectors with approximately marginal distribution X i . However, the global dependence between the different output variables is not captured byS X . Far from being a problem (which may be addressed for example with copulas [20]), this is an advantage of this approach: with the use ofS X as activation function, we ease the job of the generator network of generating the marginal distribution, so that it can focus on the capture of the non-linear interrelations among the different components, which is typically the hardest part to generate.

Performance metrics
We propose to evaluate GANs performance using two different types of metrics. The first set of metrics is inspired by the L 1 functional distance and the Jaccard coefficient [31] and aims to quantify the similarity of the synthetic data with respect to the real data from a statistical perspective, considering the joint distribution of data features. On the other hand, the second set of metrics attempts to quantify the performance of synthetic data when it is used as a substitute for real data in the training of a ML classifier that aims to distinguish among the different types of data contained in the data set. To apply this metric, it is obvious to assume that the real data are labelled and that the data set contains more than one type of data.
These two types of metrics will be used to compare the similarity between real and synthetic distributions, and the later set will also be applied to implement a stopping criterion for GAN training that will allow us to select the best generators producing high-quality synthetic data. It is worth noting that in preliminary experiments we compared the aforementioned metrics with standard Jensen-Shannon and Wasserstein distributional distances. We finally decided to include only the former in our experiments, since the latter sometimes exhibited oscillatory behaviours that were not present in the former.

L 1 distance and Jaccard index
These two metrics try to measure the difference between the probabilistic distributions of real and synthetic data. They are based on two well-known statistical coefficients applied for hypothesis testing and probabilistic distances: the L 1 -metric and the standard Jaccard coefficient.
Although both metrics use the probability density function of the two data distributions to compute the distance, they can be straightforwardly extended to a more practical scenario where the density functions of the data distributions are not known. Instead, we shall compute an empirical estimator through the histogram to replace the probability density function.
In the following, we shall sketch briefly the main ideas involved in the construction and estimation of these quality metrics. For further details, please refer to [17].
Empirical probability density function Let us suppose that we have samples x 1 , . . . , x n ∈ R d of a d-dimensional random vector X. Let us choose a partition of the support of X into disjoint cubes C 1 , . . . , C s . For simplicity, we shall take all the cubes C i of the same volume. The empirical probability density function h X : where χ Cj is the characteristic function of the cube C j (i.e. χ Cj (x) = 1 if x ∈ C j and is 0 otherwise) and |{x i ∈ C j }| stands for the number of samples that belong to the cube C j . By the Glivenko-Cantelli theorem [33], the empirical probability density function h X is a faithful estimator of the actual probability density function of X.
L 1 Distance Given two continuous d-dimensinal random vectors X and Y with probability density functions f X and f Y , we can consider the L 1 distance between their probability density functions, that is However, in applications, it is not common to explicitly know the probability density functions of X and Y . Instead, from a collection of samples x 1 , . . . , x n and y 1 , . . . , y m of X and Y , respectively, we can compute their empirical probability density functions h X and h Y . In this way, the empirical L 1 distance can be taken as where L is a constant depending only on the volume and number of elements of the partitions taken and h X (C j ) (resp. h Y (C j )) is the value of the function h X (resp. h Y ) on C j . In this empirical setting, we have that d emp if and only if the number of samples of X on each cube C j equals the number of samples of Y .
Jaccard index This metric is designed to compare the supports of two distributions. In this way, instead of looking at the particular distribution function, the aim of this metric is to determine whether the two random variables satisfy the same value constraints.
Suppose that we have two random variables X and Y with supports supp(X) and supp(Y ), respectively. The Jaccard index of X and Y is where |A| stands for the Lebesgue measure (i.e. the volume) of the measurable set A. This coefficient takes values in the interval [0, 1] and the larger the value of J(X, Y ) the more similar the empirical supports.
Again, if the real support is not known, we can still estimate it through the empirical probability density functions as

Nested ML performance
The second set of metrics attempts to quantify the performance of synthetic data when it is used as a substitute for real data for training a ML classifier.
To be precise, suppose that our data set of real data, let us call it DS, is labelled for a supervised classification ML task. In other words, DS contains instances of s ≥ 2 different classes which are appropriately labelled. For the sake of notational simplicity, we shall consider the case s = 2 of binary classification (as appears in the experiments of this work), but the approach can be straightforwardly generalized.
In order to train a ML model, as usual, we can split DS into two data sets, DS1 and DS1, with similar statistical properties. In this way, DS1 can be used for training a ML classifier, whereas DS2 is reserved for testing its accuracy through the standard classification quality measures: F 1 -score, precision and recall.
Nevertheless, apart from the training the ML classifier, DS1 can also be used to train GANs aiming to replicate its data. Hence, using DS1 we train two GANs (Λ 0 , G 0 , D 0 ) and (Λ 1 , G 1 , D 1 ) to synthesize data with label 0 and 1 respectively. Choose N, M > 0 and draw samples x 0 1 , . . . , x 0 N and x 1 1 , . . . , x 1 M of the latent spaces Λ 0 and Λ 1 respectively. Then, using the generators G 0 and G 1 , we create a new fully synthetic training data set DS1 with N + M instances joining the synthetic data generated by both GANs.
With this new synthetic dataset DS1 , we train a standard ML classifier (say, a random forest classifier [13]). Then, screening the precision, recall, and F 1 -score of the classifier against DS2, we are able to measure the quality of the generated data: the higher these measures, the better the synthetic data that was generated by (Λ 0 , G 0 , D 0 ) and (Λ 1 , G 1 , D 1 ). Hence, large values of these coefficients point out that the synthetic data generated by G 0 and G 1 can be used to faithfully substitute the real instances. Observe that no real data is used for such training purposes, although real data is always used for testing.
Additionally, as a baseline comparison for the metrics obtained with GAN synthetic data, we can also consider the ML classifier trained with DS1 and compute its performance metrics with DS2 as the testing data set. In this way, we can compare the performance of the ML classifier trained with GAN synthetic data against the benchmarklevel performance obtained using real data during the training of the ML classifier. Notice that our approach highly differs from many existing works that only mix real with synthetic data (e.g., data augmentation solutions), which can generate data privacy breaches as real data is present in the resultant data set.
Marginal quality evaluation From a practical perspective, beyond the aforementioned process, we can also evaluate the marginal quality of each of the generators before jointly evaluating the quality of G 0 and G 1 . We generate only one of the types of data, say label 0, and we mix the synthetic samples of label 0 with real samples of label 1 obtained from DS1. This data set is used to train a ML classifier and then the classifier is tested on DS2 to get the performance measures. This process is repeated for each type of data. In this way, the corresponding ML accuracy coefficients will only measure the ability of G 0 to generate label 0, regardless of the fitness of G 1 and vice versa.
In our experiments, we apply a variant of this approach that computes the performance of the ML classifier at each of the training epochs of the GAN. In this way, we are able to screen the evolution of the training and to relate it to the quality of the generated data. In particular, this idea enables a novel stopping criterion: when the GAN training epochs do not produce any significant enhancement in the performance of the ML classifier, the training process of the GAN is stopped. It is worth noting that this approach allows to train each type of GAN in parallel and therefore, each training can be stopped at different epochs when no significant enhancement is observed in a particular GAN.
Choice of the best GAN model After each GAN is trained, the joint performance of both types of synthetic data is computed. Using the whole set of GANs obtained during the marginal quality evaluation would imply to compute the ML performance for each pair (G 0 , G 1 ) of generators G 0 and G 1 at each of their training epochs. This leads to a quadratic number of generators to be tested in the ML task, both for training and testing, to obtain the full set of measurements.
However, we observed experimentally that drawing roughly a dozen samples by choosing uniformly at random one generator of each type of data tends to produce results equivalent to the brute force approach of trying all possible combinations. In addition, we applied more elaborated strategies based on performance elitism, ordering the generators of each type of traffic by F 1 -score and choosing generators at random only from the subset containing the best generators.
Finally, we would like to remark that, since we have generative models able to create as many samples as needed, we can choose the number of generated samples N and M as large as desired. If we choose N and M in the same range as the number of instances in the original dataset DS1, we obtain a synthetic data set with very similar characteristics to the original one. In particular, any unbalancing between classes will remain. However, other choices can be made. For instance, we can decide to take N = M , so that the obtained data set corrects the unbalanced situation, or to take N and M much larger than the size of the original dataset, so that we increment the amount of data available for the ML classifier. It is worthy to mention that, even though this solution gives rise to a balanced dataset as with data augmentation procedures, the proposed solution is substantially stronger than simple data augmentation: the synthetic data is not a simple enrichment of the original data set but a completely new data set.

Dataset description
To demonstrate the applicability of the proposed solution, we have defined two different use cases to experimentally validate the versatility of ST-based GANs. The general objective of the two use cases is to test whether GAN-generated synthetic traffic can fully replace real traffic in ML problems where the use of real data could generate privacy breaches.
First use case: Rendered data set For the first use case considered in this work, we have designed two fictitious data sets (denoted by DS1-r and DS2-r). As previously commented in subsection 4.2, we train GANs with DS1-r, while DS2-r is used for training the ML-classifier that will assess the quality of the synthetic data generated by the GANs. In addition, DS1-r is used for training the benchmark ML classifier. Both data sets contain entries of two different data types (i.e., 2 labels) and are perfectly balanced, containing 400, 000 entries for each label. The data sets are composed of four variables, two of them contain continuous values and the other two contain discrete values. Each of the entries of the data set was generated by sampling from two random vectors (one per label class) made of independent random variables with the distributions shown in Table 1 (see also their histogram in Figures 7 and 8.  This use case aims to demonstrate that the use of linear activation functions at the last layer of the generator fails to replicate complex data distributions such as the ones we have rendered and in particular, the variables representing discrete data distributions. On the contrary, we show that, using ST-based activation functions, we are able to perfectly replicate from a statistical point of view, both continuous and discrete variables, even if the data variables follow complex statistical distributions.
It is worth noting that the statistical distributions of the two types of data have been generated in such a way that they are similar on average, which makes the task of an ML classifier more complicated when we want to train it to correctly identify the two types of data. Indeed, if the synthetic data have not been generated with sufficient fidelity to the two real data distributions, because the means of their 4 variables are so close, the ML classifier trained with synthetic data will obtain a significantly worse performance in terms of accuracy, precision and recall than a benchmark classifier trained with real data.
Finally, observe that some of the distributions of the 8 variables (4 per data type) have been generated with statistical patterns different from the Gaussian distribution (Table 1) to demonstrate that the generators with linear activation do not replicate with precision the real distributions when they are not Gaussian or discrete, and that on the contrary, when the generators have activation functions based on ST, the distributions of the synthetic data exactly replicate the real variables even if their distributions follow statistical patterns very different from Gaussian distributions (e.g. discrete distribution. Second use case: Network traffic The second use case aims to evaluate the replication by GANs of data coming from a real scenario in the cybersecurity domain. The real data used in this experiment were previously generated in a realistic network laboratory called the Mouseworld lab [24]. The Mouseworld is a network digital twin created at Telefónica R+D facilities that allows deploying complex network scenarios in a controlled way. In this lab, a set of virtual machines were deployed for the generation of regular network traffic (e.g., web and video flows) jointly with cryptomining clients connected to public mining pools in the Internet [25].
We ran the experiment twice for one hour, collected the transmitted packets, and generated two data sets (denoted by DS1-c and DS2-c) each with 4 millions of flow-based entries containing statistics of the connections. Normal traffic connections were labelled with 0 and cryptomining ones with 1. It is worth noting that both data sets are totally unbalanced, containing only 4, 000 entries of criptomining connections.
A set of 59 statistical features were extracted from each TCP connection, although we selected a reduced set of 4 for our experiments: (a) number of bytes sent from the client, (b) average round-trip time observed from the server, (c) outbound bytes per packet, and (d) ratio of packets inbound / packets outbound. These four features were selected as they exhibit two interesting properties for our generative experiments that were previously commented in the first use case: (i) each feature presents a different statistical behaviour far from a Gaussian distribution and (ii) the mean of each feature in the two types of traffic (normal and cryptomining) were close, which makes the task of an ML classifier more complicated when we want to train it to correctly identify the two types of data.
The nature of both types of traffic is very different, a fact that will be reflected in the quality of the GAN-generated data. The normal traffic has a great variety since it is composed of many types of connections (e.g., video, audio, web elements, and multimedia elements). On the contrary, the cryptomining connections are handled by a reduced set of protocols and therefore, their statistical patterns are not expected to differ substantially. Due to the greater diversity that normal traffic connections exhibit when compared to cryptomining connections, GANs that try to replicate label-0 data will perform slightly worse than their counterparts that replicate label-1 data.

Proposed architecture
Aiming to mimic synthetic data with several types of behaviour, we adopted in preliminary experiments a wellknown conditional GAN model, the so-called Auxiliary Classifier GAN (AC-GAN) [21], as the architecture to generate at the same time all types of variables. In both use cases, the ACGAN did not produce an adequate performance when replicating the two types of data and moreover, it generated significant oscillations in the convergence process. For that reason, we opted to use a different GAN for each type of data to be generated.
To get rid of the mode collapse problems that frequently appear during GAN training, we adopted as a reference model the WGAN architecture [2], in which a Wasserstein loss function is used as the loss function instead of a standard cross-entropy function. We tested two different strategies to enforce the required Lipschitz constraint in the cost function, weight clipping ( [2]) and gradient penalty ([10]), not observing any significant enhancement in the convergence of the GAN training. Therefore, we chose a WGAN architecture with no additional strategy to enforce the Lipschitz constraint and with a discriminator with small learning rates as heuristic to avoid reaching mode collapse situations.
Regarding that the statistical nature of the 4 features to be synthetically replicated in both use cases did not exhibit any topological structure or time relationship among them, convolutional or recurrent networks would not take any advantage of it. Therefore, we selected fully connected neural networks (FCNNs) as the architectural model for both the discriminant and generator networks. LeakyRelu functions were used as activation layer in all layers, except for the last layer of the generator. Based on previous experiments [17], no filtering based on the output of the discriminator was applied to discard synthetic data, nor was noise added to synthetic or real data during the training of the discriminator to help GAN convergence.

Experimental results
In this section, we analyze the performance on each of the use cases of Section 5.1 of four different generative approaches: (i) when real data is used and no generation occurs, (ii) with a simple mean-based generator, (iii) with a standard WGAN, and (iv) with a WGAN with ST-based activation function. As we will show, the ST-based solution outperforms both the standard GAN and the simple mean-based generator, reaching an accuracy in a nested ML classifier similar to the one obtained with real data.
As ML classifier, we selected a Random Forest model with 300 estimators. In the first use case, we limited each tree depth to 20 to avoid some overfitting effects that appeared in preliminary experiments. No depth limit was applied to trees in the second use case. In the first use case, a balanced set of samples were obtained for each label (200, 000 samples per label totalling 400, 000 samples) for training and testing. In the second use case, we kept the original ratio of the two labels (many more normal traffic connections than cryptomining ones) and we got 400, 000 samples of label 0 (normal traffic) and only 4, 000 of label 1 (cryptomining connections) for training and testing. In the testing process, we establish additional decision thresholds of 0.2, 0.4, 0.6 and 0.8 to the default 0.5 in order to analyse the results of the default and the best performing threshold. Finally, training and testing of ML classifiers were run 100 times in all experiments to minimise biased behaviours during sampling and training.
For training WGANs, we used previous knowledge from past experiments [17], and applied the set of hyperparameters detailed in Table 2 performing a blind random search in the hyperparameter space guided by the F 1 -score obtained in a nested ML-model that was executed evaluating the marginal quality of the generator after 10 mini-batch trains (see Section 4.2). For each type of data, the WGAN selected was the one that obtained the best F 1 -score for the nested classifier in any of its mini-batches. As optimization algorithms, we used Adam for generators and RMSProp for discriminators and the binary cross-entropy loss function was substituted by the Wasserstein loss. The hyper-parameters chosen for the generator and discriminator in each use case are detailed in the last two columns in Table 2.

Real data
To establish an upper bound on the expected performance of the nested ML classifier, a benchmark classifier was trained 100 times for each of the two use cases with real samples from the first data set DS1 and tested using samples from the second data set DS2. The first row in Table 3 summarises the obtained F 1 -score values and confusion matrices in testing for the best decision threshold and the default threshold (0.5) in the first use case (rendered data). The first row of Table 4 summarises the same information for the second use case (cryptomining attack).
Alongside, Figure 13a plots the histogram of the statistical distribution of F 1 -score results obtained in the first use case when a ML classifier was trained with DS1-r and tested with DS2-r. Similarly, Figure 15a shows the histogram of the statistical distribution of F 1 -score results obtained in the second use case when a ML classifier was trained with DS1-c and tested with DS2-c.

Mean generator
Opposed to the previous results, we established a baseline in our experiments through a naïve generator designed to generate new data by adding Gaussian noise to the means of the data features. The standard deviation of the noise was manually adjusted to produce the best results and hence, a more challenging baseline. Each nïaive model was tested with the corresponding DS2 data set of each use case.
A summary of the baseline results obtained with this naïve model can be found in the second rows of Table 3 and  Table 4, for the first and second use cases respectively. Histograms showing the F 1 -score values obtained after running the mean generators 100 times are shown for each use case in Figure 13b and Figure 15b.

GANs: Linear and custom activation
For the two use cases of Section 5.1, we ran a set of experiments to obtain the performance of a ML classifier trained with GAN synthetic data and tested with DS2 data sets (DS2-r for the first use case and DS2-c for the second). GANs were always trained using DS1 data sets (DS1-r for the first use case and DS1-c for the second). The GANs were trained during a fixed set of 25, 000 mini-batches (50 epochs). Every 10 mini-batches, the GAN generator model was saved for posterior use, and the obtained L 1 and Jaccard distances of synthetic and real data were computed. In addition, as mentioned in Section 4.2, a training data set was generated mixing GAN synthetic data of the current label with real data sampled from the other label of DS1. Using this hybrid data set, we trained a ML classifier, and then the ML model was tested with DS2. The obtained F 1 -score represents the marginal performance of the GAN synthetic data for the current label and, in addition, provides a potential early stopping criterion for GAN training. Note that although both DS1 and DS2 contain real data, the performance against DS2 provides a more reliable measure, since DS1 was used for training the GAN and therefore, the GAN generator could have learnt specific information only contained in DS1.
After training standard and ST-based WGANs for the two labels, we run the following experiment for each type of GAN (standard and ST-based) to highlight the advantage of the proposed ST solution: For each label, a WGAN generator is selected uniformly at random among all models stored previously every 10 mini-batches. Then, a completely synthetic data set is produced using the generator of label 0 and the generator of label 1. Using this synthetic data set we train the ML classifier and test its performance with DS2 obtaining the F 1 -score value and the confusion matrix. This process was run 100 times to compare the statistical distribution of the obtained F 1 -score values for the standard and ST-based WGANs. In addition, we repeated the experiment not selecting uniformly at random each label generator among the whole set of stored generators for a label but among the top 10 sorted by the marginal F 1 -score for this label (i.e., using F 1 -score elitism). In this way, we explored whether it is more efficient to search the best performing label 0 and 1 generators among all stored generators or using the F 1 -score elitism. In addition, we analyse whether the ST-based solution performs better than the standard WGAN using this elitism.
Distances from synthetic to real data Figure 1 shows the evolution of L 1 distance and Jaccard index during the GAN training for label 0 and label 1 in the first use case. It can be observed from both labels that the L 1 distance curve for the ST-based WGAN stabilises faster, shows less oscillations and achieves smaller values (around 0.3 for both labels when GAN training is stabilised) than in the standard WGAN with linear activation (around 0.6 for both labels in the minimum points of the curves). With respect to the Jaccard index, the results of the ST-based WGAN conclude in a similar way: The curves for both labels achieve high values (around 0.7 for label 0 and 0.9 for label 1), stabilise faster (in 4 epochs for label 0 and 6 epochs for label 1) and do not exhibit significant oscillations. On the contrary, the Jaccard curves of the standard WGAN for the two labels show a bad performance with values not greater than 0.4. These results highlight that in the first use case the similarity of the synthetic data generated by the ST-based WGAN generator and the real data is much higher than when the synthetic data is generated by the standard WGAN generator.    The L 1 distance and Jaccard index results for the second use case are shown in Figure 2. Similarly to the first use case, the distance in ST-based WGAN stabilises quickly (10 epochs for label 0 and 20 for label 1), with a very low value (around 0.25) and without significant oscillations indicating that the quality of the generated synthetic data is high. On the contrary, standard WGAN suffers from remarkable oscillations and the distance value is not small (from 0.65 to 1.6 in label 0 and from 0.6 to 1.25 for label 1), which highlights that the similarity of the synthetic data generated by the GAN and the real data is not very high. Analyzing the Jaccard coefficient in the figure for the ST-based WGAN curve, it can be seen that values around 0.4 and 0.6 are obtained for labels 0 and 1. It is intuited in the figure that with more training epochs the former values would continue to grow. In contrast, the standard WGAN curve quickly stabilises around a very small value of 0.1 for both labels, which shows that in this case, the statistical distributions of real and synthetic data are quite different.
We can conclude that in both use cases, the ST-based WGAN replicates for the two labels the statistical behaviour of the real data (DS1 data set) with high precision and requiring only a few training epochs. In addition, the quality of the synthetic data produced throughout the training process does not suffer from significant oscillations. On the contrary, the standard WGAN replicates with worse quality the statistical distribution of the real data, needs more training epochs, and the quality of the generated data suffers from high oscillations during the training process, which prevents its use in real applications.
Evolution of synthetic data quality In this section, we analyze the evolution of the quality of the synthetic data generated by the standard and ST-based WGANs with respect to the real data. Figures 3 and 4 for the first use case, and Figures 5 and 6 for the second, show graphically the obtained distributions. To this end, we compare and plot samples of real and synthetic data distributions at different mini-batches (1, 100, and 1000) corresponding to the epochs 0, 2 and 20 respectively. To ease the visualization of the above-mentioned figures, we have flattened the four histograms corresponding to the four features of the data set into a 1-dimensional plot. To be precise, each of the points in the x-axis of the plot represents a cube (a bin) in which the empirical probability density function was computed (see Section 4.1). In this way, comparing the cubes of two samples, we can infer whether the two data distributions are similar or not. For example, samples of two data distributions with elements placed in different cubes would point to data distributions with significant differences. On the contrary, if the elements of both distributions are mapped to the same cubes and the number of elements in each cube is similar for both samples, we could infer that the two data distributions are similar.
To plot and compare the synthetic and real data distributions, the cubes of the real data sample are sorted by the number of elements in each cube in ascending order. In this way, the marks on the x-axis represent the cubes as ordered for the real data sample. Therefore, the real data curve always exhibits an ascending shape. The y-axis represents the number of sample elements that are placed in each cube. It is worth noting that higher numbers on the x-axis indicate that the WGAN has created many nonexistent elements (in the real data set) that are assigned to new bins. These bins containing elements outside the real data domain are placed on the left side of the WGAN curve since the real data curve has no elements in such bins.
In the first use case, it can be observed in Figure 3 that the standard WGAN for label 0 creates many more     first use case in Figure 7 (label 0) and Figure 8 (label 1), and for the second use case in Figure 9 (label 0) and Figure  10  In general, it can be observed in all figures that the ST-based WGAN replicates the histogram and the KDE function of each variable with high accuracy. On the contrary, for the two uses cases and the two labels, the standard WGAN tends not to mimic the statistical distribution of the real data.
It is worth noting that one of the key innovations of our proposal is the ability of ST-based WGANs to replicate discrete data variables. To the best of our knowledge, none of the existing solutions provides a clean approach to solve this crucial issue as the ST-based approach does. In the first use case, an ad-hoc data distribution was designed to contain two discrete variables (features 1 and 3) in each label to highlight the advantage of our solution in comparison with current solutions. It can be observed in the Label 0 histograms for features 1 and 3, (Figure 7b and Figure 7d) that  the ST-based WGAN perfectly replicates the discrete nature of these features. In sharp contrast, the standard WGAN fails on this task and generates two synthetic variables that follow a continuous distribution. The same situation appears for label 1, as variables 1 and 3 (Figure 8b and Figure 8d) are perfectly replicated by the ST-based WGAN and on the contrary, the standard WGAN generates synthetic variables following a continuous distribution.
As a final remark, note that the discrete variables can be categorical or not (e.g., an ordered sequence of integer numbers) as the ST solution does not impose any assumption on this and therefore, provides a general solution for any discrete variable.
Marginal F 1 -score Looking at the marginal F 1 -score values obtained for each label in the first use case ( Figure  11), we observe that the standard WGAN generators for both labels fail completely when they are used to marginally replace real data in training an ML classifier, as they obtain very bad F 1 -score metrics (around 0.35 for label 0 and 0.4 for label 1) in both testing (DS2-r) and training (DS1-r) data sets. On the contrary, the ST-based WGAN obtains F 1 -score values close to the ones obtained using real data in the training of the ML classifier. The benchmark classifier trained with real data obtained an F 1 -score of 0.812 with DS2-r and using the synthetic data generated by the STbased WGAN we obtained around 0.75 for label 0 and 0.7 for label 1 with DS2-r as testing data set. Furthermore, the ST-based WGAN did not generate any significant oscillations in the F 1 -score curve after the maximum F 1 -score values were reached at 4 and 8 epochs respectively for labels 0 and 1.
In general, the results of the marginal F 1 -score for the second use case aligned with those obtained in the first use case. When the ST-based WGAN was used with DS2-c as the testing data set, higher F 1 -score values were achieved (around 0.6 for label 0 and 0.85 for label 1) without incurring significant oscillations when GAN training was estabilised (around 8 epochs for both labels). In contrast, the standard WGAN performance was not good, with small F 1 -score values for label 0 (from 0.1 to 0.5) and noticeable oscillations throughout the GAN training for both labels.
We can conclude that in both use cases, the ST-based WGAN obtained better marginal F 1 -score values, stabilised to the maximum values faster and without producing significant oscillations once stabilised.
Nested ML evaluation In this section, we analyze the performance of the nested ML classifier when a fully synthetic data set is generated using GANs. Recall that this data set is created by mixing samples created by the GAN generators of labels 0 and 1, and it is used to feed a ML classifier. As criteria for picking the generators to be used for each label, we applied two different strategies: (i) drawing a random sample among the whole set of stored generators, and (ii) drawing a random sample from the top 10 models sorted by the marginal F1-score of the label (i.e., using F 1 -score elitism).   Figure 12: Cryptomining attack scenario (second use case). Evolution of the marginal F 1 -score on training and testing, using GAN generators with linear activation (top row) and IST-based activation (bottom row) for labels 0 and 1. The first and second columns correspond to the evolution for the GAN trained to generate label 0, whereas the third and forth columns correspond to the generation of label 1. The x-axis represents the GAN training epochs. Table 3 summarises the results obtained for the first use case and Figure 14 shows detailed histograms of the F 1 -score results obtained after running 100 times each experiment. In the light of the results presented in these plots, the main conclusions are: (i) Both the ST-based and standard WGANs can be used for training a ML classifier as their performance is better than that of the noise generator and only slightly worse than that of the real data.
(ii) The synthetic data generated by the ST-based WGAN performs slightly better than that of the standard WGAN, both when random selection is done on the whole set of models (0.793 against 0.789 for the best F 1 -score obtained) and when the F 1 -score elitism is used (0.783 against 0.775 for the best F 1 -score obtained). Table 3: Rendered data set (first use case). Performance of synthetic traffic combining labels 0 and 1. Training with (i) a real data set (R-DS1), (ii) a mean-based noise generator and (iii and iv) GAN synthetic data (with linear and ST activation). Results on testing with real data (R-DS2). Experiment is drawn 100 times uniformly at random ( Figure  13 and Figure 14). For the GANs, in each sample we choose one generator from among all label 0 generators and one generator from among all label 1 generators. (iii) The interval of the F 1 -score values obtained when using the ST-based WGAN (from 0.75 to 079) was concentrated near to the maximum value and was significantly smaller than that obtained with the standard WGAN that are noticeably dispersed in a larger interval from 0.5 to 0.79. Hence, selecting at random ST-based generator models is highly likely to obtain a synthetic data set that performs close to the best model combination and the real data.
In contrast, if we select generators at random from the standard WGAN, it is less likely to obtain a synthetic data set that comes close to the performance of the real data.
(iv) When the ST-based activation was used, although the maximum value obtained with F 1 -score elitism was slightly lower than that obtained with the uniform drawing method, the distribution of results was slightly more concentrated near the maximum value. Table 4 summarizes the results obtained for the second use case and Figure 16 shows detailed histograms of the F 1 -score results obtained after running each experiment 100 times. In a general sense, the results are aligned with the ones of the first use case: (i) ML classifiers trained with synthetic data generated by WGANs obtain a similar performance than when trained with real data. In fact, the best value of standard WGAN with F 1 -elitism was slightly greater than the best Figure 13: Rendered data set (first use case). Baseline. F 1 -score on DS2-r (left) and DS1-r (right) using as training data set: (13a) a real data set (DS1-r) and (13b) a naive noise generator with means. Results for decision thresholds of 0.2, 0.4, 0.5, 0.6 and 0.8 are represented. Each experiment was run 100 times.
value obtained with real data.
(ii) The interval of the F 1 -score values obtained when using the ST-based WGAN was concentrated near to the maximum value and this effect was accentuated when F 1 -elitism was applied. Hence, selecting at random STbased generator models is highly likely to obtain a synthetic data set that performs close to the best model combination and the real data.
(iii) In contrast, when using the standard WGAN, the interval of F 1 -score values is significantly wider and only a very small percentage of them are close to the best value obtained with real data, which precludes to use this method as it is not likely to produce a realistic synthetic data set when generators are selected at random. However, this effect slightly decreases when F 1 -score elitism is used for selecting the generators.

Conclusions
We proposed a novel activation function to be used as output of the generator agent of a GAN. This activation function is based on the Smirnov probabilistic transformation (ST) and it is specifically designed to improve the quality of the generated data. The ST-based activation function provides a general approach that deals not only with the replication of categorical variables but with any type of data distribution (continuous or discrete). Moreover, this activation function is derivable and therefore, it can be seamlessly integrated in the backpropagation computations during the GAN training processes.
The experimental results evidence a clear outperformance of the GAN network tuned with ST-based activation function with respect to a standard GAN. The quality of the data is so high that the generated data can fully substitute Table 4: Performance of synthetic traffic combining labels 0 and 1. Training with (i) a real data set (DS1), (ii) a mean-based noise generator and (iii and iv) GAN synthetic data (with linear and ST activation). Results on testing with real data (DS2). Experiment is drawn 100 times uniformly at random (Figure 15 and Figure 16). For the GANs, in each sample we choose one generator from among all label 0 generators and one generator from among all label 1 generators.

Dataset
Quality real data for training a nested classifier without a fall in the obtained accuracy. This result encourages the use of GANs to produce high-quality synthetic data that are applicable in scenarios in which data privacy must be guaranteed.