Mixed-type data generation method based on generative adversarial networks

Data-driven based deep learing has become a key research direction in the field of artificial intelligence. Abundant training data is a guarantee for building efficient and accurate models. However, due to the privacy protection policy, research institutions are often limited to obtain a large number of training data, which would lead to a lack of training sets circumstance. In this paper, a mixed-type data generation model based on generative adversarial networks is proposed to synthesize fake data that have the same distribution with the real data, so as to supplement the real data and increase the number of available samples. The model first pre-trains the autoencoder which maps given dataset into a low-dimensional continuous space. Then, the generator constructed in the low-dimension space is obtained by training it adversarially with discriminator constructed in the original space. Since the constructed discriminator not only consider the loss of the continuous attributes but also the labeled attributes, the generator nets formed by the generator and the decoder can effectively learn the intrinsic distribution of the mixed data. We evaluate the proposed method both in the independent distribution of the attribute and in the relationship of the attributes, and the experiment results show that the proposed generate method has a better performance in preserve the intrinsic distribution compared with other generation algorithms based on deep learning.

In this context, big data analysis and research often encounter problems such as lack of data and too few training samples. In order to solve this problem, the current research ideas are mainly carried out from two aspects: information hiding and data generation. From the perspective of data hiding, for example, health care organization (HCO) can reduce the risk of information leakage by interfering with potential identifiable attributes through generalization, suppression and randomization and then sharing data [2,3]. However, criminals can still restore the personal tags corresponding to the data through the remaining attribute information, so as to restore the original data.
With the development of deep learning and various learning model proposed, data generation based methods have attracted more and more attention in the field of data privacy protection. Its main idea is to capture the potential distribution structure of data sets by learning from very limited real data, and then generate synthetic data having similar distribution with the real data, so as to solve the problem of data deficiency [4]. In this work, we focus on generating high-dimensional mixed-type (continuous and discrete) data, compared with single-type data no matter continuous or discrete, which is a more important and challenging problem on its own. We propose a new data generation architecture which combines the versatility of an autoencoder with the recent success of Adversarial Networks (GANs) on complex data type. To assess the quality of the synthetic data, we define several new metrics that evaluate the performance of synthetic mixed-type data compared to the original data.

Related works
Nowadays, depth-generation model has been proved to be a highly flexible and expressible unsupervised learning method that can capture the potential structure of complex high-dimensional data. The well-trained depth generation model can effectively simulate the complex distribution of high-dimensional data and generate synthetic data similar to the original data [5,6]. Early work on data generation are more widely based on Variational Autoencoder(VAE) [7], such as Variational Lossy Autoencoder [8], DVAE++ [9] and ShapeVAE [10]. These method have been shown to be efficient and accurate to capture the latent structure of vast amounts of complex high-dimensional data. However, they can not handle data with discrete featrues let alone continuous and discrete mixed data generation. Recently, Nazábal [11] proposed a general framework named HI-VAE, which is suitable for heterogenous data generation and presents competitive predictive performance in supervised task.
The GANs model have achieved great success in the field of synthesize image generation, such as MMD-GAN [12], AdaGAN [13] and WGANs [14], which adopts the idea of antagonistic game and consists of two parts, generator G(·) and discriminator D(·) : the generator learns the distribution of the real samples and generates fake data to simulate the real data; the discriminator aims to distinguish between the real data and the fake data [15,16].
With the practical application and theoretical development of GANs, more and more data scientists scholars have turned their attention to the this model [17]. At present, most researches related to GANs are focused on continuous datasets, but the application of big data science usually involves discrete variables with multi-label features. Training networks with discrete outputs is a main challenge that curbs the application of the GANs in the field of big data analysis. The main difficulty behind this is that the output of the network is always transformed by softmax function into a multinominal distribution. However, sampling from this distribution is not a differentiable operation, which curbs the gradient flow to back propagate during the training of GANs for data with discrete features. To tackle this problem, the Gumbel-softmax technique is proposed to be equipped in the VAE and GANs based method for sequences discrete data generation [18][19][20]. Aiming at the same problem, seqGAN [21] proposes a stochastic strategy based on reinforcement learning to avoid the back propagation of discrete sequences.
Another method to avoid the back propagation of discrete data is Adversarially regularized autoencoders(ARAE) [22]. The author transforms the discrete words learned from text into continuous potential feature space, and uses GANs to generate potential feature distribution, which effectively improve the training stability and obtain a loss more correlated with sample quality. medGAN proposed by Choi et al. [23] is inspired from this concept, which can learn the realistic healthcare patient records and generate the synthesize data. The model hybrid the autoencoder with GANs, which first pre-train an autoencoder and then the generator maps latent code space back to original space, and the discriminator receives the fake data from generator or sample from real data to form an adversarial learning.
To improves the medGAN for generating of multi-label variables, Camino et al. [24] proposed Multi-categorical GANs based on the concept of medGAN. The idea behind it is to encode the multi-label variables into a binary representation using one-hot encodings [25], and apply Gumbel-Softmax [18] to solved the problem of multi-label data back propagation which improves the computation stability and convergence speed.
To the extent of our knowledge, most of the GANs based data generation work are focus on single type feature data generation, numerical type or discrete type. Apart from these research, we propose a mixed-type date generation model based on GANs, which improves the performance of mixed-type data generation by leveraging the fact that autoencoder has the ability to learn the intrinsic characteristic of mixed-type features and build the generator in the code space. The proposed framework equip the Gumel-softmax technique to deal with the problem of undifferential of discrete random varialbes, and optimized the loss function to balance the gradient flow coming from different mixed type features. We also provide elaborate empirical evaluation for generation model based on the Lending Club datasets. The results demonstrate that the proposed method has better performance than state-of-the-art VAE based method [11] not only in terms of approximation of distribution for single feature by also for approximation of the correlation between features.

Description of mixed-type data
In this paper, we assumes that the features of the data is composed by two types: numerical type and muti-label type. The data space is defined as S = (W × V) , where the numerical space W = W 1 × · · · × W M (W ∈ R M ) . In numerical space, we define random vector as x = (x 1 , . . . , x M ) ∈ W . The multi-label space is formed as V = V 1 × · · · × V N , Where V i represent each multi-label feature(such as men and women, some possible occupation, etc.), the number for each categories per label is defined as d i = |V i | . We also define the random variable in space V as v = (v 1 , v 2 , . . . , v N ) ∈ V , and each label variable v i is encoded by one-hot and denoted as a vector y i ∈ {0, 1} d i . So the random variable in space S can be fully expressed as S = (x, y) = (x 1 , . . . , x M , y 1 , . . . , y N ) , and y i = (y i,1 , . . . , y i,d i ).

The proposed mixGAN
The mixGAN proposed in this paper first pre-trains an autoencoder, which maps the mixed data space to a low-dimensional continuous space. Due to the fact that the intrinsic feature of the data can be more efficiently represent in the mapped low-dimensional continuous code space, the generator G(·) of the mixGAN is established in code space. The discriminator D(·) is established in the original mixed-type data space to identify the real data or fake data. The mixGAN is obtained by joint antagonistic learning between the generative network G(·) and discriminator D, and trained across over the original space and code space. Our mixGAN model is represented from the Pre-autoencoder to GANs respectively.

Pre-autoencoder
The autoencoder is composed by a encoder and a decoder. The encoder compresses the original high-dimentsional data to the low-dimension code space. Then, the decoder maps the code space back to the original data space. The auto-encoder network is trained to obtain encoder and decoder network, so that after the original data x go through the whole autoencoder system, the output of the network is a good approximation x to the input. Our proposed Pre-autoencoder modifies the traditional autoencoder by replacing the last output layer with a mixed-type layer output, which is formed by N + 1 parallel features extraction Dense layers as shown in Fig. 1. At the end this parallel structure are the activation output function to transfer the components back to their original features.

Fig. 1 Autoencoder
The parallel structure of the output layer model not only guarantees the independence of the each single feature but also maintains the interdependence between features.
The encoder network is simply composed by two layers FCN. The decoder network is firstly composed by two FCN mapping the code space to a continuous lower vector, after that, there is an N + 1 parallel data type separation networks Dense 0 , . . .  Fig. 1.
In this model, the Gumbel-softmax sampling technique is used to sample the discrete distribution, which widely used for discrete data generation, since it has the ability to solve the problem of discrete random data back-propagation [18]. Gumbel-softmax sampling technique models the hidden variable as a discrete multinomial distribution, and the transformation process satisfies the following formula: where j = 1, . . . , N , k = 1, . . . , d j , and a j is the output of full connection layer Dense j , and a j,k is the output of Dense j 's k-th component. τ ∈ (0, ∞) is a hyperparameter greater than zero, which controls the softening degree: the higher the τ value is, the smoother the distribution; The lower the τ value is, the closer the generated distribution is to the discrete One-Hot distribution. In the process of training, the real discrete distribution can be approached gradually by gradually decreasing τ . Let g i be i.i.d samples drawn from Gumbel(0, 1) = − log(− log(u i )) with u i ∼ U (0, 1).
Our pre-autoencoder loss function is shown in (2), which is compose of two parts: the the mean square error is utilized for the loss of numerical type and cross entropy error is utilized for the loss of multi-label type. Before input the training data to our model, we will first normalize the numerical features to (0,1), which can balance the two type of the loss in (2) and address the problem that the numerical type loss would dominate all loss and lead to poor performance for multi-lable type data approximation.
where x m represents the m-th component of x , y j,k represents the k-th component of multi label feature y j , and B is the size of training batch.

Generative adversarial network
The generative confrontation network consists of two network modules: the generator network and the discriminator network [15]. The generator G(z; θ g ) learns the distribution of the training data, and converts the input random prior distribution into a generated sample G(z) with a similar distribution to the training data. The discriminator D(x; θ d ) is a two type classifier used to determine whether the input data set is a real sample or a generated fake sample, that is, the discriminator will output a larger probability for real data, and a smaller probability for false data. In the training process, G(·) and D(·) are made to play against each other until the data generated by G(·) can "cheat" D(·) . the optimization goal of the above game process can be expressed as: where P data represents the distribution of real samples, and P z represents a random prior distribution subject to N (0, 1) . In the process of alternating training G(·) and D, the parameter optimization follows the following iterative formula: where B is the size of each training batch, and α is the iterative step size of the optimizer.

The architecture of mixGAN
The proposed mixGAN is constructed across the code space and original space. The method is inspired by the recent successes in discrete data generation using GANs [24], which addressed the difficulty of discrete random variable back propagation by using Gumbel-softmax sampling technique. We use the encoder which comes from the pre-trained autoencoder to map the original data to a low-dimensional continuous code space, where we build the GANs based generator. basedUtilize this concept, the generator network G(z) transfer the standard gaussian variable z ∼ N (0, 1) to code space, then, the Decoder network Dec(·) maps the generated continuous variable back to original space ŝ . This process is shown in Fig. 2, and can be expressed as Dec(G(z)) appeared in generation loss (7). The discriminator D(·) is build in the original space, which judges weather the input item is real or fake by using the discrimination loss (6).
The proposed mixGAN is an architechture coupling the pre-autoencoder model and GANs structure, which combines the ability that the pre-autoencoder can capture the mixed-type data information and the ability of GANs which has high performance for continuous data generation. At the same time, the limitation of the discrete data learning ability of GANs is solved by this architechture.
As shown in Fig. 2, the data generated by the generator G(·) is decoded before being imported into the discriminator. It can be seen that the discriminator D's judgment of the authenticity of the data is performed in the original space. In the training process, the loss functions for discriminator D(·) and generator G(·) are represented in (6) and (7): (log D(x i )+ log(1 − D(Dec(G(z i ))))) During the main training phase, the gradients flow from the discriminator to the decoder and afterwards to the generator, and the decoder will be fine-tuned while optimizing the generator.

Experiment
To assess the performance of our model, we use HI-VAE method [11] as a benchmark, we uses it as a benchmark for comparative evaluation. HI-VAE distinguishes between different feature types in the data when encoding and decoding, and designs a corresponding probability model for each type. According to the probability model corresponding to each features, the HI-VAE encoder processes the feature individually, and aggregates all attribute processing results to generate the code. The HI-VAE decoder performs the inverse process of the above processing, that is, the code is converted into various feature values and concatnate together.

Data acquisition
Our training dataset is a subset of high dimensional bank customs, which is hosted by Lending Club [26]. We randomly sampling 10,000 recorders from the original dataset, which are partitioned by 9:1 for training set and test set. The original dataset has 31 features, and we removed the 7 of them which have constant value. We rearrange the features of the dataset, so that the features of the dataset matches our data model; first 15 features are numerical type and the rest 9 features are mult-label type with One-Hot coded. Hence, we have s i = [x 1 ; . . . ; x 15 ; y 1 ; . . . ; y 9 ] , and category number for each label type is listed as (2, 2, 2, 12, 2, 7, 29, 4, 3).

Fig. 2 mixGAN
There is a common problem in the big data processing, that is, most time the numerical type values always have quite different magnitude than One-hot coded label type. Therefore, if we training the model using the raw data, the gradient flow come from the numerical type will dominant the back propagation, which will weaken the learning ability and reliability. In our experiment, we utilize Min-Max normalization method to stretch the range of the numerical features into 0-1, in order to make their ranges have similar magnitude with the one-hot coded multi-label features. Empirically, the normalization process not only improves the accuracy of the model but also accelerate the convergence of the training.

Implementation details
The proposed pre-autoencoder of the model contains two hidden FCM layers for both encoder and decoder, all the layers are activated by tanh function. We empirically set the latent continuous code space to 72 dimension, and the hyperparameter τ appeared in Gumbel-Softma activation function is set as 0.6.
For GANs training, the generator G(·) and discriminator D(·) of GANs are all implemented based on FCM with 3 layers for each, which are [256, 128, 72] and [128, 64, 1]. The batch normalization skill is also used between the layers. Referring to the work in [24], the hidden layers in G(·) are activated by Tanh function, while the hidden layer of D(·) are activated by LeakyRelu function. We use Adam algorithm to optimize the model, and set the learning rate lr = 0.002 and set weight decay as 0.001. The batch size is set as B = 100 . Finally, the training time of the pre-autoencoder is 52.30s, and the training time of the mixGAN model is 880.64s.

Results
To evaluate the performance of the GANs is widely known as a difficult task [27]. Borji [27] provides a range of commonly used metrics used for assessing the performance of the GANs, but they are not suitable for big data generation evaluation. In this paper, we suppose that if the generated data have a good approximation to the original data, it should satisfy the following two conditions: firstly, in terms of each single feature, the distribution of the generated value should be as close as possible to the real data distribution; secondly, The dependency among features should be similar to that of real data. Based on the above assumptions, we evaluates the performance of the mixGAN from perspective of the distribution approximation for single feature and the correlation maintenance between features.

Distribution approximation for single feature
To evaluate the approximation for independent distribution in each feature, we deals with the features of the numeric type and the label type respectively. For the numerical type x i , we quantified the interval (0-1) into 10 bins, by which we can calculate the histograms of the generated and the original feature. After that, we pair each histogram bin using the original real distribution and the generated fake data (P real , P fake ) . Similar to the concept of the joint histogram, if the two random variables have similar distribution, the paired points (P real , P fake ) should located diagonally alone joint distribution coordinate plane.
The similar concept is applied to the mult-label type features. We can see that each component of the y i,j is either 1 or 0, since the label type y i has been one-hot coded. Hence, we accumulate all data across the each feature component y i,j and denoted it with P real and P fake for original label feature and generated label feature. It can be proved that if the synthesized label type features y i have a good approximation to the original data, the paired points (P real , P fake ) should also distribute alone the diagonal of the coordinate plane.
Following these concept, we plot the paired points (P real , P fake ) in the Fig. 3, where the (a) is drawn by using our proposed mixGAN, and (b) is drawn using the HI-VAE proposed in [11]. The circular point represents the label type, and the star point represents the numerical type. We can find Fig. 3 that mixGAN has a apparently better performance than HI-VAE in independent feature approximation, since the paired points come from mixGAN, not only the numerical type or label type, are all distributed more closer to the diagonal than HI-VAE.

Correlation maintenance between features
The basic idea for assessing the correlation between features of generated data is: in generated dataset, the impact for a feature f i come from the rest of the features should be as similar as possible to the original data. According to the concept, we establish a learning model to estimate the feature f i by using the rest features. The model is formulated as a multi classification task, when f i is of multi-label type, and formulate it as a regression task when f i is of numerical type. We denote the estimation loss for f i by using the real data as E i real , and E i fake for the generated data. In testing, all the estimation model is formed by FCN, but the loss function is formulated depend on the feature type of f i . We formulate the loss function for numerical feature f i as 1 N N j=1 (x i j −x i j ) 2 , and formulate the loss function as 1 N N j I(y i j =ŷ i j ) where I(·) is indicator function, and N is the total number of samples in testing set.
In Fig. 4, we plot all the paired points (E real , E fake ) in the plane, where (a) and (b) are the estimated errors by using the mixGAN and HI-VAE [11]. It shows that the proposed mixGAN method is superior to HI-VAE in the maintenance of features correlation especially better for numerical type features.