1 Introduction

Nowadays, Generative adversarial network (\({{\texttt {GAN}}}\)) model and its variants are widely utilized in a wide range of domains ranging from Computer Vision [1], Data Privacy [2], Medicine [3] etc. due to their excellent performance when compared to other methods. Typical \({{\texttt {GAN}}}\) models [4] constitute of two components: the Generator learns to produce a synthetic output from the input noise, whereas the Discriminator learns to distinguish the generator’s synthetic data from the real one. These two components interact with each other based on a game-theoretic algorithm such that as the training continues, the generator learns to produce better and better synthetic data, and the discriminator learns to more accurately distinguish between the synthetic and the real data. Typically the generator and the discriminator components are modelled as deep Artificial Neural Networks (\({ { \texttt {ANN} } }\)), with the convolution and dense layers. Though their application to synthetic data generation for computer vision tasks is ground-breaking [5,6,7], their application to tabular data generation, has been marred with challenges. Of course, the main challenge stems from the fact that no explicit structure is available among the input data features which can be exploited by the convolutional layers [8], leaving the responsibility to dense layers to engineer features in order to capture the correlation among features. Note that tabular datasets mainly constitute of categorical and numeric features. How to seamlessly handle these different feature types is not trivial. Regardless of the challenge, application of \({{\texttt {GAN}}}\) to tabular data generation has seen promising results, with models like \({ { \texttt {CTGAN} } }\) [9],  \({ { \texttt {TableGAN} } }\) [8],  \({ { \texttt {MedGAN} } }\) [3] leading to the state-of-the-art (SOTA) results. E.g, \({ { \texttt {CTGAN} } }\) generates tabular data via conditional \({{\texttt {GAN}}}\)-like strategy, where the categorical features are regarded as some condition, while using the Gaussian Mixture Model estimation for numeric features. It utilizes the Wasserstein Distance with gradient penalty to generate the synthetic data. \({ { \texttt {TableGAN} } }\), on the other hand, actually uses the convolutional layers in both the generator and the discriminator stages, and introduces an information loss-based objective function.

We believe that tabular data generation (especially with \({{\texttt {GAN}}}\) models) is still in early days, and there is strong demand for more accurate and far better interpretable data generation models. In this work, we will demonstrate that \({{\texttt {GAN}}}\) strategy is indeed effective, but (instead of relying on \({ { \texttt {ANN} } }\) and its variants) a fundamentally different approach to its data generator and discriminator components can lead to much better results. Before we discuss our formulation, let us discuss some limitations of existing \({{\texttt {GAN}}}\)-based tabular data generation models:

  • First, effective modelling of feature interactions is critical in many machine learning tasks, and data generation is no exception [10, 11]. Of course, generators and discriminators in vanilla \({{\texttt {GAN}}}\) models can capture feature interaction, but this implicit feature interaction will not be interpretable [11]. Also, there is no guarantee that any particular interaction is captured by the model. For example, we may have some prior knowledge that feature Salary is highly correlated with feature Age—but the generator in vanilla \({{\texttt {GAN}}}\) may or may not capture this interaction. Because, it is likely that the model can find some other more useful interactions. This lack of explicit feature interaction modelling is one of the main factors impacting the performance of synthetic data generation. Of course, one solution is to craft the feature interaction manually—but this will be tedious and time consuming.

  • Secondly, related to the first issue that we discussed, the usage of \({ { \texttt {ANN} } }\) and variants in existing \({{\texttt {GAN}}}\)-based tabular data generation models lead to a generation process that is hard to interpret. For tabular data generation, since the goal is to improve the performance of machine learning tasks (e.g. classification, regression, etc.), the demand for interpretability is far more crucial.

It can be seen that these two limitations of existing vanilla \({{\texttt {GAN}}}\)s mainly stem from the fact that their generator component is carved out of a deep \({ { \texttt {ANN} } }\) (or its variants). Can we utilize a different model as the generator to address these limitations? Answering this question has been the main motivation of this work.

In this work, we propose a radically different formulation, which departs from existing vanilla \({{\texttt {GAN}}}\)-based data generation—instead of using a deep \({ { \texttt {ANN} } }\) (or its variants) as the generator (and the discriminator), we utilize the Bayesian Network (\({ { \texttt {BN} } }\)) model. A \({ { \texttt {BN} } }\) is a directed acyclic graphical model—in which the training process incorporates learning the structure as well as its associated parameters. By specifying (or learning) the structure, one can explicitly incorporate feature interactions. Note, since its parameters correspond to actual probabilities, the model is interpretable. It can be seen that \({ { \texttt {BN} } }\) has desirable data generation properties which can address the above-mentioned limitations of existing tabular \({{\texttt {GAN}}}\) models. But how can one use \({ { \texttt {BN} } }\) in the \({{\texttt {GAN}}}\) formulation? To answer this question, let us dive deeper in \({ { \texttt {BN} } }\).

Typical \({ { \texttt {BN} } }\) models works with tabular data and maximizes the log-likelihood (LL) \(\sum _{i=1}^{m} \log \textrm{P}(y^{i}, \textbf{x}^{i})\), where m is the number of data points, \(\textbf{x}\) is a vector of independent features with discretized values and y is the dependent target feature. Note, the numeric features of \(\textbf{x}\) are discretized prior to calculating the probabilities in \({ { \texttt {BN} } }\) models. A \({ { \texttt {BN} } }\) is an example of a generative model, as one can use it to generate (sample) data once the model is learned. Of course, one can calculate \(\textrm{P}(y^{i} | \textbf{x}^{i})\) in \({ { \texttt {BN} } }\) to obtain a classifier. However, the predictive performance of \({ { \texttt {BN} } }\) classifiers is not generally good, as compared to the models that directly optimizes \(\textrm{P}(y^{i} | \textbf{x}^{i})\)—also known as the discriminative models. Interestingly, generative models such as \({ { \texttt {BN} } }\) can be trained by optimizing a discriminative objective function such as the conditional log-likelihood (CLL) \(\sum _{i=1}^{m} \log \textrm{P}(y^{i} | \textbf{x}^{i})\) [12]. For example, a popular example of \({ { \texttt {BN} } }\) is naive Bayes (\({ { \texttt {NB} } }\)) classifier, whose discriminative equivalent is the well-known Logistic Regression (\({ { \texttt {LR} } }\)) model which of course optimizes the CLL. \({ { \texttt {NB} } }\) and \({ { \texttt {LR} } }\) are well-known examples of generative-discriminative equivalent models—in general, one can train any \({ { \texttt {BN} } }\) by optimizing the CLL—leading to a respective generative-discriminative equivalence. Following the notation in [13], we denote a \({ { \texttt {BN} } }\) trained by optimizing the CLL as \({ { \texttt {BN} } }^d\), and then under this notation, we have \({ { \texttt {LR} } }\equiv { { \texttt {NB} } }^d\) [14]. Typically, since structure learning of \({ { \texttt {BN} } }\) is time consuming—we can resort to simple restricted models such as Tree-Augmented Naive Bayes (TAN) or K-Dependence Bayesian estimators (KDB), in which the structure learning only takes one or two passes through the data. Now, one can train \({ { \texttt {TAN} } }\) or \({ { \texttt {KDB} } }\) by optimizing a discriminative objective function, i.e. CLL—leading to either \({ { \texttt {TAN} } }^d\) or \({ { \texttt {KDB} } }^d\) formulations.

Although one can sample from a standard \({ { \texttt {BN} } }\) the samples might not be of good quality—since when optimizing \(\textrm{P}(y,\textbf{x})\), there is no guarantee that \(\textrm{P}(y|\textbf{x})\) will be well-calibrated [15]. One solution is to directly sample from \({ { \texttt {BN} } }^d\), which optimizes \(\textrm{P}(y|\textbf{x})\). Well, here the issue is that the parameters are not constrained to be actual probabilities.Footnote 1 One solution is to constrain the weights to be the actual probabilities during the training of a discrmiminative objective function. Such weights constraining has been explored for \({ { \texttt {LR} } }\) and \({ { \texttt {KDB} } }\) in [13]. Following the naming conventions from existing work in this area, we denote a \({ { \texttt {BN} } }\) which is learned discriminatively, but constrain weights so that they conform to actual probabilities as \({ { \texttt {BN} } }^e\).

We claim that \({ { \texttt {BN} } }^e\) is the perfect model to be used as a generator in \({{\texttt {GAN}}}\) models.Footnote 2 Particularly, it optimizes a discriminative objective function (and hence can be learned end-to-end with back-propagation algorithm), the weights are actual probabilities so you can sample, and importantly you can interpret the model based on the learned weights. One drawback of using \({ { \texttt {BN} } }\) in \({{\texttt {GAN}}}\) models is that it can only process and hence generate data with categorical attributes only. Note, it has been recently shown that contrary to popular belief, discretization can lead to models that have superior performance as compared to their numeric counter-parts [16]. Therefore, discretizing numeric attributes, and then sampling using \({ { \texttt {BN} } }^e\) can be an effective alternative to directly generating the numeric attributes.

In our proposed formulation of \({{\texttt {GAN}}}\) models, we will use typical Bayesian Network models as the generator (\({ { \texttt {BN} } }^e\), e.g. \({ { \texttt {NB} } }^e\), \({ { \texttt {TAN} } }^e\), \({ { \texttt {KDB} } }^e\)) and their discriminative counter-parts as discriminator (\({ { \texttt {BN} } }^d\), e.g. \({ { \texttt {NB} } }^d\), \({ { \texttt {TAN} } }^d\), \({ { \texttt {KDB} } }^d\), etc.). We name this new formulation as: Generative Adversarial Network modelling inspired from Naive Bayes and Logistic Regression’s relationship (GANBLR). However, \({ { \texttt {NB} } }\) and \({ { \texttt {LR} } }\) terms in the \({ { \texttt {GANBLR} } }\) acronym are only figurative to represent broad generative and discriminative learning paradigms. In practice, one can use any generative model as the generator and its corresponding discriminative model as the discriminator.

Let us summarize the contributions of this work:

  • We propose a novel model of tabular data generation, namely \({ { \texttt {GANBLR} } }\)—which uses \({ { \texttt {BN} } }^e\) as the generator and \({ { \texttt {BN} } }^d\) as the discriminator. Note, the generator in \({ { \texttt {GANBLR} } }\) cooperates with the discriminator to form an adversarial structure to improve the quality of synthetic tabular data. Note, \({ { \texttt {GANBLR} } }\) is limited to producing datasets that have categorical attributes only.

  • Even though \({ { \texttt {BN} } }^e\) has been studied in the context of classification—this is the first work which studies its effectiveness as a data generation technique.

  • We compare GANBLR to existing SOTA GAN models on 15 public tabular datasets. The results demonstrate that GANBLR not only outperforms in terms of machine learning utilityFootnote 3 and statistical similarity measured with Jensen-Shannon Divergence and Wasserstein Distance, but also provides better interpretability.

The rest of the paper is organized as follows. We will discuss related work in Sect. 2. The details of GANBLR are provided in Sect. 3. We provide an extensive experimental analysis in Sect. 4. We conclude in Sect. 5 with pointers to future works.

2 Related work

In this section, we will start by discussing the existing \({{\texttt {GAN}}}\)-based models for tabular synthetic data generation. Later, we will discuss discriminative training of Bayesian Network models.

2.1 Tabular data generation—\({{\texttt {GAN}}}\) models

The current research on the application of \({{\texttt {GAN}}}\) models for the tabular synthetic data generation has taken two directions. The first direction utilizes the vanilla \({{\texttt {GAN}}}\) structure, whereas the second direction utilizes the conditional models based on conditional \({{\texttt {GAN}}}\) structure. In the following, let us discuss these two directions in detail.

2.1.1 Vanilla \({{\texttt {GAN}}}\)-based tabular generation

This stream of research is based on the foundational work of [4], in which a random noise (generated from a predetermined distribution) is used as the input for the generator. The generator will use the input to approximate a real data distribution with the encoder and decoder model. The output of the generator is the generated synthetic data. The discriminator uses this synthetic data generated and also the real data to train a classifier for distinguishing the synthetic data from the real data. There are four notable works that utilize this vanilla \({{\texttt {GAN}}}\) strategy, and we will discuss them in the following. Note, we will use the term vanilla \({{\texttt {GAN}}}\) to refer to the model of [4]. Of course, almost all works utilizing \({{\texttt {GAN}}}\)s are variants of the framework proposed in [4].

\({ { \texttt {MedGAN} } }\) [3] model is one of the earliest work to use the auto-encoder architecture as the structure of the generator. The model can generate both categorical and numerical features that are needed to generate authentic medical electronic health record. The training in \({ { \texttt {MedGAN} } }\) model utilizes mini-batch averaging to address the mode collapse problem. Additionally, the batch normalization with shortcut connections are utilized.

\({ { \texttt {CrGAN} } }\) model [17] is designed to generate Passenger Name Records (PNR) data for aviation industry. The PNR data contains details of passenger’s personal information such as name, date of birth, the reserved trip information, the flight information and the payment details, etc. Note, PNR data can consist of both categorical and numerical features which can have missing values. For generating features with missing values, use of normal vanilla \({{\texttt {GAN}}}\) structure can be challenging. To address this issue, \({ { \texttt {CrGAN} } }\) model propose categorical feature embedding as well a Cross-Net architecture. It is shown that \({ { \texttt {CrGAN} } }\) model can generate high-quality PNR data.

A convolutional neural network is utilized as the generator of \({ { \texttt {TableGAN} } }\) [8], and an information loss is used instead. It is shown that \({ { \texttt {TableGAN} } }\) not only ensures a high machine learning utility, but claim to preserve the privacy of the data. Note \({ { \texttt {PATEGAN} } }\) [2] is another notable work that is similar to \({ { \texttt {TableGAN} } }\), that is designed to prevent privacy attacks.Footnote 4

The above-mentioned vanilla \({{\texttt {GAN}}}\)-based tabular generation models are effective in different domains, but they still face two limitations. Firstly, these models have been designed and proposed for binary class datasets on specialized domains, and therefore, their suitability and generalization to other domains (e.g. for multi-class datasets) is not clear. Secondly, the above models are not capable of generating synthetic data with a guaranteed value of one of the features. The second limitation makes these models unsuitable for generating imbalanced machine learning datasets, e.g. in fraud detection or abnormal detection, etc.

2.1.2 Conditional \({{\texttt {GAN}}}\)-based tabular generation

The conditional \({{\texttt {GAN}}}\)-based tabular data generation models make use of a conditional-vector to specify the particular feature value or class label to be generated. Notable work in this stream of research is that of \({ { \texttt {CW-GAN} } }\) [18], which has been shown to get results better than competing methods on credit data generation. There are three different loss functions included in the \({ { \texttt {CW-GAN} } }\) model. The first loss is the Wasserstein Distance Loss which is calculated between the synthetic and the real data. The second loss is the gradient penalty which can be used to regularize the model complexity of the discriminator model. The last loss is the auxiliary classifier loss which can encourage generator in generating synthetic data which belongs to the specified class.

The current SOTA model for tabular data generation \({ { \texttt {CTGAN} } }\) [9] falls in this research stream, as well. \({ { \texttt {CTGAN} } }\) leverages mode-specific normalization and training-by-sampling process to generate better quality synthetic datasets with both categorical and numeric features. Inspired by Gaussian Mixture Model, the mode-specific normalization firstly computes the modes of a numeric feature. After this, the mean and standard deviation values for each of the modes are captured. Then numerical feature values are normalized with associated mean and standard deviation. The newly obtained normalized value is concatenated with the categorical features together to represent the input for \({ { \texttt {CTGAN} } }\) model. The training-by-sampling strategy is the key component of the \({ { \texttt {CTGAN} } }\)’s generator. It ensures that the instances from the minor class have the similar chance to be sampled as the instances from the major class.

Although CW-GAN and \({ { \texttt {CTGAN} } }\) can generate the synthetic data with particular feature value, the drawbacks of existing \({{\texttt {GAN}}}\)-based methods for tabular data generation, that we mentioned in Sect. 1, are still outstanding. Firstly, the models are not interpretable, i.e. the synthetic data generation process is not interpretable for the practitioners to determine as to why a generated data belongs to a particular class. Secondly, the input feature interactions cannot be modelled directly during the training. Lastly, the performance of these methods on a wide range of datasets ranging from small to large, binary to multi-class, etc. is still to be systematically studied.

2.2 Discriminative Bayesian network

Let us discuss discriminative Bayesian Networks in this section. In standard Bayesian networks, one can learn feature interactions as part of the structure learning under the restricted [19] or unrestricted mode. In unrestricted mode, we learn the structure of the network from the data and do not limit the number of parents each attribute can take. This process can be computationally intensive and hence time consuming. Alternative to this is the restricted mode—where we make use of count statistics such as Mutual Information, etc. and limit the number of parents each attribute can take, leading to model such as TAN and KDB [20], etc. The KDB model can learn the structure in just one or two passes through the data, and is, therefore, very computationally efficient. The second phase of learning in a Bayesian Network is the learning of the parameters, and this depends on what objective function is optimized. As discussed in Sect. 1, traditionally, Bayesian networks are optimized with log-likelihood, a generative objective function leading to a closed-form solution. However, one can optimize conditional log-likelihood, a discriminative objective function, leading to formulations of discriminative class-conditional Bayesian Network (\({ { \texttt {BN} } }^d\)) and the extended class-conditional Bayesian Network (\({ { \texttt {BN} } }^e\)) [21]—optimized via an iterative optimization algorithm. Since the goal of Bayesian Network has been to estimate probabilities of the form \(\textrm{P}({\textbf{y}}|\textbf{x})\), there is some debate over the parameters of discriminative Bayesian Network to be actual probabilities or not. This is the main difference between two formulation \({ { \texttt {BN} } }^d\) and \({ { \texttt {BN} } }^e\)—where parameters are not constrained in the former, but are constrained in the later to be actual probabilities. Of course, the main benefit of using discriminative Bayesian Network (\({ { \texttt {BN} } }^e\)) is the interpretability and also capability of incorporation of higher-order feature interaction. For example, if the parameters associated with high-order interactions are constrained to be actual probabilities, it offers excellent interpretable capabilities.

3 Method

Let us start by presenting some preliminary work to be used as a foundation, as well as some notation that is used through-out in this paper. Later, we will delve deep in our proposed algorithm—\({ { \texttt {GANBLR} } }\), and discuss the learning algorithm, as well as discussing its variant.

3.1 Preliminary and notations

Table 1 List of symbols used in this work

We denote the generative model (generator) as \({\textbf{G}}\) and the discriminative model (discriminator) as \({\textbf{D}}\). The real (or original) dataset is denoted as \({\mathcal {D}}_{\text {data}} = [(X_{g}^{k},Y_{g})]\), where \(X_{g}^{k} = [\textbf{x}^1,\textbf{x}^2..,\textbf{x}^m]\), where \(\textbf{x}^i \in {\mathcal {R}}^{n}\), i.e. data with a total of m instances each having n independent features, with k-order feature interaction present among them. Similarly, \(Y_{g} = [y^1,y^2..,y^m]\), where \(y^i \in {\mathcal {R}}^{1}\), constituting corresponding class labels. Data (\({\mathcal {D}}_{\text {data}}\)) has a maximum level of interactions present among features, which is denoted by k hereFootnote 5. Of course, for a generator to produce samples effectively, it must be able to model these k-order interactions present in the data. If we are using a \({ { \texttt {BN} } }^e\) model as the generator, we can easily specify k; however, if a traditional deep \({ { \texttt {ANN} } }\) is used, modelling of interactions of order-k is more of a trial-and-error practice to determine the right breadth and depth of the generator network. The term g in the notation makes it explicit that the dataset is to be processed by the generator model—\({\textbf{G}}\).

We denote the real data distribution as \(\textrm{P}_{\text {data}}(X_{g},Y_{g})\) or \(\textrm{P}_{\text {data}}(\cdot )\) from which a sample \({\mathcal {D}}_{\text {data}}\) is generated.

In \({{\texttt {GAN}}}\) formulation, \({\textbf{G}}\) is trained to approximate the real data distribution \(\textrm{P}_{\text {data}}(.)\) with some random (noise) input. We denote the random input data as: \({\mathcal {Z}} = [\textbf{z}^1,\ldots ,\textbf{z}^m]\). And the distribution generating \({\mathcal {Z}}\) as \(\textrm{P}_{{\mathcal {Z}}}({\mathcal {Z}})\) or \(\textrm{P}_{{\mathcal {Z}}}(.)\).

The synthetic dataset is denoted as \({\mathcal {S}}_{\text {data}} = [({\bar{X}}_{g}^{k},{\bar{Y}}_{g})]\), where \({\bar{X}}_{g}^{k} = [{\bar{\textbf{x}}}^{1},\ldots ,{\bar{\textbf{x}}}^{m}]\), and \({\bar{Y}}_{g} = [{\bar{y}}^{1},\ldots ,{\bar{y}}^{m}]\). Here, \({\bar{\textbf{x}}}^i \in {\mathcal {R}}^n\) and \({\bar{y}}^i \in {\mathcal {R}}^1\). Again, the superscript k denotes that the synthetic data should have the same order of feature interactions as in the original dataset.

As we know that in \({{\texttt {GAN}}}\) formulation the generator generates synthetic data from the noise—in our notation, we express it as: \({\mathcal {S}}_{\text {data}} \sim {\textbf{G}}({\mathcal {Z}})\). The discriminator model—\({\textbf{D}}\) is trained to discriminate between \({\mathcal {D}}_{\text {data}}\) and \({\mathcal {S}}_{\text {data}}\). To do this, an auxiliary label \(Y_{d} = 1\) and \(Y_{d} = 0\) is appended with \({\mathcal {D}}_{\text {data}}\), and \({\mathcal {S}}_{\text {data}}\) respectively, specifying if the sample belongs to original or synthetic data. Formally, the objective function of tabular \({{\texttt {GAN}}}\) models leads to solving the min-max adversarial game, which in our notation is expressed as:

$$\begin{aligned} \begin{aligned} \max _{{\textbf{D}}} \min _{{\textbf{G}}} \quad&{\textbf{E}}_{{\mathcal {D}} \sim \textrm{P}_{\text {data}}(.)}[\log {\textbf{D}}({\mathcal {D}}_{\text {data}})] \\&+ {\textbf{E}}_{{\mathcal {Z}} \sim \textrm{P}_{{\mathcal {Z}}}(.)}[\log (1 - {\textbf{D}}(\underbrace{{\textbf{G}}({\mathcal {Z}})}_{{\mathcal {S}}_{\text {data}}}))].\\ \end{aligned} \end{aligned}$$
(1)

It can be seen that \({\textbf{G}}({\mathcal {Z}})\) generates the synthetic dataset samples \({\mathcal {S}}_{\text {data}}\), and \({\textbf{D}}\) tries to map the synthetic data to a scalar value representing the probability of it being real or not.

In the following, we will discuss how to use \({ { \texttt {BN} } }^{e}\) as the generator and \({ { \texttt {BN} } }^{d}\) as the discriminator, leading to our \({ { \texttt {GANBLR} } }\) formulation. List of all the symbols used in this work is given in Table 1.

3.2 \({ { \texttt {GANBLR} } }\)—components

The generator in our proposed formulation deviates from vanilla \({{\texttt {GAN}}}\) as it has two roles to play:

  • Its first role is to learn the parameters of \({ { \texttt {BN} } }^{e}\). By doing so, it learns the weights by optimizing the discriminative objective function, while fulfilling the probability constraints on the weights. We denote the generator for this training role as \(\tilde{{\textbf{G}}}\). Note, the input to \(\tilde{{\textbf{G}}}\) is the original data \({\mathcal {D}}_{\text {data}}\), i.e. we can write: \(\tilde{{\textbf{G}}}({\mathcal {D}}_{\text {data}})\).

  • The second role of the generator is to sample data after the discriminative Bayesian Network \({ { \texttt {BN} } }^e\) is trained. Since the optimized parameters are conditional probabilities, now we can use \({ { \texttt {BN} } }^e\) in the generative mode to sample the synthetic data samples \({\mathcal {S}}_{\text {data}}\). We denote this sampling role as \(\bar{{\textbf{G}}}\). The input to this role of generator is null—i.e. we can write: \(\bar{{\textbf{G}}}(.)\). Note, this formulation deviates from existing tabular \({{\texttt {GAN}}}\) models, as our generator does not generate from random noise distribution.

The two roles of the generator in \({ { \texttt {GANBLR} } }\) work seamlessly in an overall adversarial training framework—first, the generator operates in a training role (\(\tilde{{\textbf{G}}}\)) for optimizing its weights discriminatively under some constraints, and then shifts its role to sampling (\(\bar{{\textbf{G}}}\)), for synthetic dataset generation. For the sake of simplicity, we will use notation \({\textbf{G}}\) for generator in cases where the role of generator is cleared from the context.

The discriminator \({\textbf{D}}\) in \({ { \texttt {GANBLR} } }\) is again a Bayesian Network—\({ { \texttt {BN} } }^d\) (which is trained discriminatively, but no constraints on the weights). It learns to distinguish between \({\mathcal {D}}_{\text {data}}\) and \({\mathcal {S}}_{\text {data}}\). The loss from \({\textbf{D}}\) is back-propagated to the generator \({\textbf{G}}\) for the improvement of the synthetic data generation. Let us delve into the details of each component of \({ { \texttt {GANBLR} } }\).

3.2.1 Generator \({\textbf{G}}\)

Let us establish the form of the generator first. In \({ { \texttt {GANBLR} } }\), we recommend to use restricted Bayesian network model of \({ { \texttt {KDB} } }\). Although any form of Bayesian network can be used in \({ { \texttt {GANBLR} } }\) framework, restricted Bayesian networks have some advantages. First, the structure can be configured easily by specifying the hyper-parameter, i.e. number of parents (k). Therefore, \({ { \texttt {GANBLR} } }\) only focuses on parameter learning given the structure. Secondly, since, a \({ { \texttt {BN} } }\) with immoral nodesFootnote 6 can lead to non-convex problems according to [22], using a restricted Bayesian network decreases the chances to obtain immoral nodes. For example, with \(k < 2\), we do not have the problem of immoral nodes. However, with \(k \ge 2\), discriminative training of \({ { \texttt {BN} } }\) can lead to non-convex optimization.Footnote 7

The \({ { \texttt {KDB} } }\) model uses the mutual and conditional mutual information to learn the structure. A typical feature interaction in a \({ { \texttt {KDB} } }\) model includes feature itself, the target feature, and its parent feature(s). As discussed earlier, we wish to train \({ { \texttt {KDB} } }\) discriminatively with some constraints fulfilled—leading to \({ { \texttt {KDB} } }^e\) formulation. However, we will use the term \({ { \texttt {BN} } }^e\) instead (for sake of generalization).

The generator in \({ { \texttt {GANBLR} } }\) optimize following two objective functions:

  • Maximizing \(\tilde{{\textbf{G}}}({\mathcal {D}}_{\text {data}})\) which is the conditional log-likelihood of the form: \(\log (\textrm{P}(Y_{g}|X_{g}^{k}))\), and

  • Minimizing the loss \(\log (1-{\textbf{D}}(\bar{{\textbf{G}}}(.)))\) or \(\log (1-{\textbf{D}}({\mathcal {S}}_{\text {data}}))\).

Instead of minimizing \(\log (1 - {\textbf{D}}({\mathcal {S}}_{\text {data}}))\) we can maximize: \(- \log (1 - {\textbf{D}}({\mathcal {S}}_{\text {data}}))\) instead, which leads to the objective function for the generator \({\textbf{G}}\) as:

$$\begin{aligned} \begin{aligned} \max _{\pmb {\theta _g} \in {\textbf{G}}} \quad&\underbrace{\log (\textrm{P}(Y_{g}|X_{g}^{k}))}_{\tilde{{\textbf{G}}}({\mathcal {D}}_{\text {data}})} - \log (1 - {\textbf{D}}(\underbrace{{\mathcal {S}}_{\text {data}}}_{\bar{{\textbf{G}}}(.)})). \end{aligned} \end{aligned}$$
(2)

Note, just like vanilla \({{\texttt {GAN}}}\) models, \(- \log (1-{\textbf{D}}({\mathcal {S}}_{\text {data}}))\) part of the objective function is not involved while training the parameters of \({\textbf{G}}\), as the discriminator \({\textbf{D}}\) is fixed during \({\textbf{G}}\)’s optimization.

Let us focus on the generator parameters (\(\pmb {\theta _g}\)) that are to be learned in Eq. 2. For this, we write \(\log (\textrm{P}(Y_{g}|X_{g}^{k}))\) as:

$$\begin{aligned} \begin{aligned} \log \textrm{P}(Y_{g}|X_{g}^{k}))&= \left( \log (\theta _{y}) + \sum _{i=1}^{n} \theta _{x_{i}|y,\pi _{x_{i}}} - \log (\sum _{y'} (\theta _{y'}\prod _{i=1}^{n} \theta _{x_{i}|y',\pi _{x_{i}}} ) \right) . \end{aligned} \end{aligned}$$
(3)

Here, \(\theta _{y}\) denotes the weight associated with the class (can be considered as class-prior or the intercept term), \(x_i\) denotes the feature value for i-th feature, and \(\pi _{x_{i}}\) denotes the set of feature values of those features which are feature i’s parents. Note, y denotes the class value, and also class is the parent of each feature. Since Bayesian Network structure leads to conditional probabilities, our notation is symbolic as we represent a weight in our network as: \(\theta _{x_{i}|y,\pi _{x_{i}}}\). In practice, \({ { \texttt {GANBLR} } }\) has a parameter \(\theta \) associated with each interaction: \(x_{i}, y, \pi _{x_{i}}\). Note, \(\log \sum _{y'} (\theta _{y'}\prod _{i=1}^{n} \theta _{x_{i}|y',\pi _{x_{i}}} )\) is the normalization term to make sure that \(\textrm{P}(Y_g|X_g^k)\) is between 0 and 1.

Notably, \({ { \texttt {GANBLR} } }\) enforces the constraints on \(\pmb {\theta _g}\), making sure that:

$$\begin{aligned} \sum _{j = 1}^{{\mathcal {X}}_i} \theta _{x_{j} | y,\pi _{x_{i}}} = 1, \end{aligned}$$
(4)

where \({\mathcal {X}}_i\) represents the cardinality of feature i, and \(x_j\) represents j-th feature value. Additionally, the following constraint is satisfied:

$$\begin{aligned} \theta _{x_{i}|y,\pi _{x_{i}}} = \frac{\exp {(\theta _{x_{i}|y,\pi _{x_{i}}}})}{ \sum _{j = 1}^{{\mathcal {X}}_i} \exp {(\theta _{x_{j} | y,\pi _{x_{i}}} })}. \end{aligned}$$
(5)

Once the \({ { \texttt {BN} } }^{e}\) weights are trained, the second role \(\bar{{\textbf{G}}}\) of the generator, that is, to generate the data, begins. One can generate the synthetic data \({\mathcal {S}}_{\text {data}}\) by using the forward sampling [23]. One can set the size of synthetic dataset size (m) in forward sampling, whereas the feature interaction order k is expected to be the same as that of generator \({\textbf{G}}\)’s input, i.e. \({\mathcal {D}}_{\text {data}}\). The sampling process of generator \({\textbf{G}}\) is quite evident, and can be expressed as:

$$\begin{aligned} {\mathcal {S}}_{\text {data}} = \bar{{\textbf{G}}}(.). \end{aligned}$$
(6)
Fig. 1
figure 1

\({ { \texttt {GANBLR} } }\) Framework

3.2.2 Discriminator \({\textbf{D}}\)

The discriminator \({\textbf{D}}\) determines the quality of the synthetic data \({\mathcal {S}}_{\text {data}}\) during the training, and then of course, back-propagate the error to the generator \({\textbf{G}}\). Note, as we discussed, the generator in \({ { \texttt {GANBLR} } }\) \({\textbf{G}}\) gets the loss from the discriminator \({\textbf{D}}\) to adjust its weights \(\pmb {\theta _g}\). The input for discriminator \({\textbf{D}}\) is shaped with \([{\mathcal {D}}_{\text {data}},Y_{d}=1]\) and the synthetic data \([{\mathcal {S}}_{\text {data}},Y_{d}=0]\). In \({ { \texttt {GANBLR} } }\), the discriminator \({\textbf{D}}\) is again a Bayesian Network model (\({ { \texttt {BN} } }^{d}\)) trained to optimize the CLL with the hyper-parameter (k) same as that of generator’s Bayesian Network (\({ { \texttt {BN} } }^{e}\)). The training of discriminator \({\textbf{D}}\) aims to maximize:

  • \(\textrm{P}(Y_{d}=1|{\mathcal {D}}_{\text {data}}) = {\textbf{D}}({\mathcal {D}}_{\text {data}})\), and

  • \(\textrm{P}(Y_{d}=1|{\mathcal {S}}_{\text {data}}) = 1 - {\textbf{D}}({\mathcal {S}}_{\text {data}})\),

by optimizing the following objective function:

$$\begin{aligned} \begin{aligned} \max _{\pmb {\theta _{d}} \in {\textbf{D}}} \quad&\log {\textbf{D}}({\mathcal {D}}_{\text {data}}) + \log (1-{\textbf{D}}({\mathcal {S}}_{\text {data}})), \end{aligned} \end{aligned}$$
(7)

where \(\pmb {\theta _{d}}\) are the parameters of \({ { \texttt {BN} } }^d\), whereas \(\log \left( 1 - {\textbf{D}}({\mathcal {S}}_{\text {data}}) \right) \) is passed on to the generator \({\textbf{G}}\), as in standard vanilla \({{\texttt {GAN}}}\) formulation.

3.3 \({ { \texttt {GANBLR} } }\)—Algorithm

The training of \({ { \texttt {GANBLR} } }\) requires to train the generator \({\textbf{G}}\)’s and the discriminator \({\textbf{D}}\)’s parameters in turns. We combine Eqs. 2 and 7 to obtain the objective function of our \({ { \texttt {GANBLR} } }\) formulation:

$$\begin{aligned} \begin{aligned} \max _{\pmb {\theta _{d}}} \min _{\pmb {\theta _{g}}} \quad&\log {\textbf{D}}({\mathcal {D}}_{\text {data}}) + \log (1 - {\textbf{D}}({\mathcal {S}}_{\text {data}}))\\&- \log (\textrm{P}(Y_{g}|X_{g}^{k})). \end{aligned} \end{aligned}$$
(8)

The complete algorithm of \({ { \texttt {GANBLR} } }\) is provided in Algorithm 1, and we provide its architecture in Fig. 1. In a total of Q iterations (epochs), the input to \({ { \texttt {GANBLR} } }\) is used to train \(\tilde{{\textbf{G}}}\), while fixing the discriminator \({\textbf{D}}\). Afterwards, the discriminator \({\textbf{D}}\) is trained to discriminate between the synthetic and real datasets.

figure e

3.4 \({ { \texttt {GANBLR} } }\)—no adversarial learning

It can be seen from Algorithm 1 that \({ { \texttt {GANBLR} } }\) can still fulfil its goal of synthesizing data without an adversarial learning component (i.e. \({\textbf{D}}\)). In practice, we can get rid of \({\textbf{D}}\) from \({ { \texttt {GANBLR} } }\)—leading to a variant configuration that we call \({ { \texttt {GANBLR-nAL} } }\). However, we argue that having an adversarial learning component can lead to much better data generation model as we will discuss later in Sect. 4.5—where we compare the performance of \({ { \texttt {GANBLR} } }\) with that of \({ { \texttt {GANBLR-nAL} } }\). Nonetheless, the \({ { \texttt {GANBLR-nAL} } }\) algorithm is provided in Algorithm 2:

figure f

3.5 \({ { \texttt {GANBLR} } }\)—summary

Let us briefly discuss two salient features of \({ { \texttt {GANBLR} } }\). From Algorithm 1, it can be seen that \({ { \texttt {GANBLR} } }\) can generate the synthetic dataset \({\mathcal {S}}_{\text {data}}\) using Eq 6. As we mentioned in Sect. 2, a desirable property of tabular generation models is generating the synthetic data with particular feature value (e.g. to resolve the imbalanced dataset limitation). Of course, the \({ { \texttt {GANBLR} } }\)’s generator can simply sample the synthetic dataset by specifying the particular feature value with rejection sampling [24], and can be effective for imbalanced data generation. Additionally, the learned parameters in \({ { \texttt {GANBLR} } }\) are actually probabilities, which can be used to interpret the generation process.

4 Experiment and analysis

Let us discuss the efficiency of \({ { \texttt {GANBLR} } }\) in synthetic data generation in this section. We consider 15 commonly used datasets to compare \({ { \texttt {GANBLR} } }\) performance with three SOTA \({{\texttt {GAN}}}\) models for tabular data generation. Specifically, we will evaluate the effectiveness of \({ { \texttt {GANBLR} } }\) in terms of:

  • Machine learning utility—which reflects the quality of the synthetic data.

  • Statistical similarity—which measures the statistical similarity between the synthetic and the real data.

  • Interpretability—which shows the interpretable capability of \({ { \texttt {GANBLR} } }\).

Both the machine learning utility and the statistical similarity are standard measures to determine the quality of data generation algorithms [9]. Moreover, we perform various ablation studies to study:

  • The effect of \({ { \texttt {GANBLR} } }\)’s hyper-parameter k, and

  • To determine the effectiveness of adversarial component of \({ { \texttt {GANBLR} } }\), by comparing \({ { \texttt {GANBLR} } }\) with \({ { \texttt {GANBLR-nAL} } }\).

We will also study the efficacy of \({ { \texttt {GANBLR} } }\) by studying and comparing its performance on two synthetic datasets. The best results are highlighted with bold font in our experiments.

Table 2 Description of datasets

4.1 Experiment setup

4.1.1 Datasets

We use 15 commonly used classification datasets and 2 synthetic datasets. Within the 15 datasets used, 13 are from UCI dataset repository and 2 are from Kaggle—namely, Credit and Loan. The two synthetic datasets are generated based on Poker-hand dataset. All these datasets have a specific dependent variable and a set of independent features. Among them, 5 are large datasets with more than 50K instances (denoted as Large), 5 are medium with less than 50K but greater than 15K instances (denoted as Medium), while the other 5 with less than 15K instances (denoted as Small). The details of datasets are summarized in Table 2.

4.1.2 Baselines and evaluation metric

We compare \({ { \texttt {GANBLR} } }\) with \({ { \texttt {CTGAN} } }\), \({ { \texttt {TableGAN} } }\) and \({ { \texttt {MedGAN} } }\). All baseline methods are trained with 150 epochs for 5 Large datasets, and 100 epochs for the Medium and Small datasets. Each experiment is repeated 3 times with 2-fold cross-validation, and averaged results are reported. It can be seen from Table 2 that most datasets have \(>2\) classes, and hence, we have reported the accuracy (instead of widely used AUC measure).

4.1.3 Configuration and running environment

The parameter k in the experiments is set to 0—that is the Bayesian network in the generator model is naive Bayes and in discriminator, it is Logistic Regression. In Sect. 4.5, we will study the effect of varying the value of k.

\({ { \texttt {GANBLR} } }\) is coded in Python 3.7 in Tensorflow 2.5 framework, with 8 core Intel i8 CPU machine with 32 GB RAM memory.

4.1.4 Machine learning utility

Machine learning utility refers to the accuracy obtained from a machine learning model [9]. In common scenario, the data used for training is real and the testing data is also real; but to evaluate the data generator models, the data used for training is synthetic and testing data is actually real. More precisely, we will make use of the following two ways to assess the machine learning utility performance:

  • \({ { \texttt {TSTR} } }\)—Training on Synthetic data, Testing on Real data, Accuracy is reported.

  • \({ { \texttt {TRTR} } }\)—Training on Real data, and Testing on Real data, Accuracy is reported.

To obtain the \({ { \texttt {TSTR} } }\) and \({ { \texttt {TRTR} } }\) performance of \({ { \texttt {GANBLR} } }\) and competing baseline methods, we will use four commonly used machine learning classification algorithms (Sect. 4.1.2):

  • Logistics Regression (\({ { \texttt {LR} } }\)),

  • Multi-layer-Perceptron (\({ { \texttt {MLP} } }\)),

  • Random Forest (\({ { \texttt {RF} } }\)), and

  • Extreme Gradient Boosting Tree (\({ { \texttt {XGBT} } }\)).

Note the \({ { \texttt {TSTR} } }\) is used to report the quality of the synthetic data generated from the proposed \({ { \texttt {GANBLR} } }\) and other baseline methods, whereas the \({ { \texttt {TRTR} } }\) is only used to indicate the ideal machine learning utility.

Fig. 2
figure 2

Machine learning utility evaluation process

Figure 2 explains the evaluation process for the machine learning utility. For \({ { \texttt {TSTR} } }\), we first split the real datasets into real training and real testing datasets; the real training datasets are used as the input for \({ { \texttt {GANBLR} } }\) and its baseline methods for training. Once training is completed, the synthetic datasets are generated. The synthetic training datasets are used to train the above-mentioned machine learning classification algorithms which will then be evaluated on the real testing datasets. The result of \({ { \texttt {TSTR} } }\) could not only show the realistic machine learning utility of all the compared methods, but also answer the question: “Can synthetic data be used as substitute of the real data without significantly impacting the performance of machine learning tasks?". Ideally, higher the accuracy from \({ { \texttt {TSTR} } }\) (high machine learning utility), the better the data generation algorithm. In contrast, \({ { \texttt {TRTR} } }\) is training of machine learning classification algorithms with real training datasets, and evaluating on real testing datasets. \({ { \texttt {TRTR} } }\) is included in the comparison for highlighting the ideal machine learning utility.

Note, we are interested in data generation methods which have higher values of TSTR.

4.1.5 Statistical similarity

Two metrics are used to quantitatively measure the statistical similarity between the real and the synthetic datasets generated by \({ { \texttt {GANBLR} } }\) and its baseline methods, that is:

  • Jensen–Shannon Divergence (JSD). The JSD quantifies the difference between the probability mass distribution of individual categorical feature in the real data and the synthetic dataset, and it is bounded between 0 and 1.

  • Wasserstein Distance (WD). Similarly, WD captures the earth moving distance on features between the real and synthetic dataset.

Note, we use the distance as a proxy of similarity, and therefore, the lower the distance, higher the similarity. Of course, we are after the data generation method that leads to lower values of JSD and WD.

4.2 Results analysis on synthetic datasets

In this subsection, two synthetic datasets are used to evaluate the effectiveness of \({ { \texttt {GANBLR} } }\) in modelling high-order feature interaction within the data. Of course, a higher accuracy on these datasets will indicate that a model has superior capability to capture higher-order interactions. The two datasets that are synthesized are based on Poker-hand dataset. The Poker-hand dataset represents a dataset where each data represents a poker hand, constituting of five cards. There are a total of 4 suits with each suit having 13 ranks. It can be seen that any model should be able to handle order-5 interactions to learn to distinguish a particular type of hand.

The first dataset that we synthesized is labelled Synthetic1, which is a four-hand version of original Poker-hand dataset. There are a total of 4 cards and 4 suits and 13 ranks, which are the same as normal Poker-hand dataset. Of course, the four-hand Poker-hand would require an order-4 model to successfully capture interactions to distinguish each hand.

The second synthetic dataset—Synthetic2 is a six-hand version original Poker-hand dataset. There are total of 6 cards in the Synthetic2 dataset and having the same number of suits and ranks as normal Poker-hand dataset. Again, the six-hand version of Poker-hand will require an order-6 model to capture the interactions in order to distinguish each hand.

In order to synthesize these two datasets, we designed an algorithm as follows:

  • First decide the version of the synthetic Poker-hand, i.e. either Synthetic1 or Synthetic2.

  • Identify the rules of each class from Poker-hand. For example, Full house is not available for four-hand synthetic Poker-hand.

  • Uniformly samples the cards for each respective hand enforcing the rules of each class.

4.2.1 Synthetic1: Four-hand Poker-hand

In this part of experiment, we compare the performance of \({ { \texttt {GANBLR} } }\) with the current SOTA \({ { \texttt {CTGAN} } }\). Note the purpose of this experiment is to evaluate how well \({ { \texttt {GANBLR} } }\)’s perform under different levels of high-order features interactions. k=0 in the results denotes a \({ { \texttt {GANBLR} } }\) model in which the generator \({\textbf{G}}\) is linear. k=1 means that the \({ { \texttt {BN} } }^{e}\) in the generator is configured to capture order-1 interactions. Similarly k=1 in the results mean that the \({ { \texttt {BN} } }^{e}\) in the generator is configured to capture order-2 interactions. Note, for systematic comparison, we keep the feature interaction level of the \({ { \texttt {BN} } }^{d}\) in discriminator \({\textbf{D}}\) the same as of generator \({\textbf{G}}\)\({ { \texttt {BN} } }^{e}\). For \({ { \texttt {CTGAN} } }\), we use the default setting.

Table 3 shows the machine learning utility performance (TSTR) on \({ { \texttt {GANBLR} } }\) with different k values. It can be seen that \({ { \texttt {GANBLR} } }\) has better accuracy when compared to \({ { \texttt {CTGAN} } }\) on Synthetic1 dataset, achieving an accuracy of \(85.61\%\) on XGBT with k=2, and an accuracy of \(83.20\%\) on RF with k=2. The results suggest that the \({ { \texttt {BN} } }^{e}\) in \({ { \texttt {GANBLR} } }\) has a strong capability to capture higher-order features in Synthetic1 dataset. Note, the advantage of \({ { \texttt {GANBLR} } }\) over \({ { \texttt {CTGAN} } }\) grows with larger values of k.

Table 3 Synthetic1 dataset machine learning utility performance (TSTR) with different k

Table 4 shows the statistical similarity performance of \({ { \texttt {GANBLR} } }\) with varying values of k. It can be seen that the \({ { \texttt {GANBLR} } }\) has better statistical similarity when compared to \({ { \texttt {CTGAN} } }\). The results also reveal that with increasing feature interaction level (k), the capability of capturing higher-order features from \({ { \texttt {BN} } }^{e}\) in \({ { \texttt {GANBLR} } }\) can make \({ { \texttt {GANBLR} } }\) produce better quality dataset.

Table 4 Synthetic1 dataset statistical similarity performance with different k (the smaller value, the better results)

4.2.2 Synthetic2: Six-hand Poker-hand

Let us now compare the performance of \({ { \texttt {GANBLR} } }\) with \({ { \texttt {CTGAN} } }\) on six-hand Poker-hand dataset. Again, we will evaluate performance in terms of machine learning utility and statistical similarity. Table 5 shows the results of machine learning utility of \({ { \texttt {GANBLR} } }\) for different values of k. It can be seen that \({ { \texttt {GANBLR} } }\) always perform better than \({ { \texttt {CTGAN} } }\) in terms of machine learning utility. Particularly, when the value of k is increased from 0 to 1, the performance of \({ { \texttt {GANBLR} } }\) with XGBT is substantially boosted from an accuracy of \(53.11\%\) to \(72.39\%\). This result is extremely encouraging and indicates that \({ { \texttt {GANBLR} } }\) can leverage the higher-order feature interaction from six-hand Poker-hand dataset effectively.

Table 5 Synthetic2 dataset machine learning utility performance (TSTR) with different k

The statistical similarity on six-hand Poker-hand dataset is evaluated to identify the quality of generated synthetic dataset. Table 6 presents the statistical similarity performance of \({ { \texttt {GANBLR} } }\) with different k values. Just like the statistical similarity results on four-hand Poker-hand dataset, it can be seen that \({ { \texttt {GANBLR} } }\) outperforms the \({ { \texttt {CTGAN} } }\) in terms of both JSD and WD distances. And the best results, as expected, are achieved with the higher values of k.

Table 6 Synthetic2 dataset statistical similarity performance with different k (the smaller value, the better results)

4.3 Results analysis on commonly used datasets

Table 7 Average Machine Learning Utility (TSTR and TRTR) comparison of \({ { \texttt {GANBLR} } }\) and competing baseline models on different-sized datasets
Fig. 3
figure 3

Machine Learning Utility (TSTR) comparison of \({ { \texttt {GANBLR} } }\) and competing baseline models on different-sized datasets in terms of box-plot (From Left to Right: GANBLR, CTGAN, TABLEGAN, MEDGAN, TRTR)

4.3.1 Results of machine learning utility

Table 7 provides the averaged machine learning utility results in terms of TSTR and TRTR, on Large, Medium and Small datasets. It is clear that \({ { \texttt {GANBLR} } }\) outperforms all other baseline methods in terms of \({ { \texttt {TSTR} } }\). Particularly on the small and medium datasets, \({ { \texttt {GANBLR} } }\) has significantly better performance than other baseline methods. It is interesting to see that \({ { \texttt {GANBLR} } }\)’s \({ { \texttt {TSTR} } }\) performance is close to \({ { \texttt {TRTR} } }\) performance, while none of the other baseline methods have the \({ { \texttt {TSTR} } }\) performance close to \({ { \texttt {TRTR} } }\). Similar findings can be drawn from Table 9 which provide detailed \({ { \texttt {TSTR} } }\) performance for all datasets. These results are extremely encouraging, as they demonstrate that the synthetic data generated from \({ { \texttt {GANBLR} } }\) is far more useful for the machine learning tasks than from any other existing SOTA data generation algorithm.

While Table 7 provides averaged results, let us look at distribution of accuracies for different datasets. In Fig. 3, we plot the box plots of the (\({ { \texttt {TSTR} } }\)) accuracy of \({ { \texttt {GANBLR} } }\) and its baseline methods on 4 machine learning algorithms. Again, we break the results in terms of Large, Medium and Small datasets. We plot \({ { \texttt {TRTR} } }\) for sake of comparison as well. It can be seen that no matter the size of datasets, \({ { \texttt {GANBLR} } }\) significantly outperforms all the baseline methods. In particular, for Small and Medium datasets, the performance of \({ { \texttt {GANBLR} } }\) is extremely impressive as the box plots of \({ { \texttt {GANBLR} } }\) (red) match highly to those of \({ { \texttt {TRTR} } }\) (orange).

Table 8 Comparison of Statistical Similarity measure of \({ { \texttt {GANBLR} } }\) with competing baseline models
Table 9 Machine Learning Utility on all datasets in terms of TSTR
Table 10 Significance test between \({ { \texttt {GANBLR} } }\) and competing baseline methods

4.3.2 Statistical similarity

To obtain JSD results—for a dataset, each feature from synthetic dataset is measured against the same feature in real dataset in terms of JSD. We repeat the process for all features and for all datasets and then report the averaged result in Table 8—which as we discussed can be seen as the measure of statistical similarly between synthetic and original dataset.

It can be seen from Table 8, that \({ { \texttt {GANBLR} } }\) stands-out when compared to the other baseline methods. If it is not the best, it is always the second best. Particularly, on Small datasets, \({ { \texttt {GANBLR} } }\) has smaller \({ { \texttt {JSD} } }\) and \({ { \texttt {WD} } }\) values than all the competing baselines, highlighting that it produces dataset of superior quality. On Medium datasets, \({ { \texttt {GANBLR} } }\) has performance similar to \({ { \texttt {CTGAN} } }\) in terms of JSD and is the second best, while has WD performance similar to \({ { \texttt {TableGAN} } }\) and again is the second best. On Large datasets, \({ { \texttt {GANBLR} } }\) has the best performance in terms of \({ { \texttt {WD} } }\), though it marginally loses to \({ { \texttt {CTGAN} } }\) in terms of \({ { \texttt {JSD} } }\) distance.

Delving into why \({ { \texttt {GANBLR} } }\) has sub-optimal results on Medium datasets—we believe that this could be due to GANBLR’s generator—Bayesian Network’s ability to generate some feature value which are rarely seen in the real dataset. Clearly, statistical similarity evaluation based on JSD and WD does not credits the generation of data which is not present in real data, but is useful for classification task.

We conjecture that another reason could be due to discriminative training of \({ { \texttt {BN} } }^{e}\), which aims to produce features that enhance the discriminative power of a \({ { \texttt {BN} } }\), and therefore, can produce slightly different datasets (Table 9). We further ran the t test to test the significance of the similarity results of Table 8. It can be seen from Table 10 that the statistical similarity results are significant between \({ { \texttt {GANBLR} } }\) and all other baselines.

4.4 Interpretation analysis

Let us study the interpretable capability of our proposed model. We believe that a good interpretation in any tabular data generator should:

  • Provide the local interpretability with clear understanding of why a synthetic data point belongs to the generated synthetic label at any time during the training. For example, given a generated synthetic data instance, the probability of each possible synthetic label should be provided.

  • Provide the global interpretability on how the features can impact the synthetic label generation generally. For example, which feature has the largest impact on the synthetic label.

4.4.1 Local interpretability

As discussed in Sect. 3, \({ { \texttt {BN} } }^{e}\) is a discriminatively trained Bayesian Network. Therefore, the learned parameters (\(\pmb \theta \)) are actually conditional probabilities of the form \(\textrm{P}(\textbf{x}|y, \Pi (\textbf{x}))\). Here \(\Pi (\textbf{x})\) denotes the function that returns the parents of each features of \(\textbf{x}\). One nice thing about probabilities is that they are super-interpretable and having access to these parameters during discriminative training gives \({ { \texttt {GANBLR} } }\) capability for interpretable learning. For example, during training, one can interpret the importance of features in determining the value of class or predicting class based on the posterior probability \(\textrm{P}(y|\textbf{x},\Pi (\textbf{x}))\). For example, one can use the following formula to determine the importance of a feature or feature-set (\(\textrm{P}(y|\textbf{x},\Pi (\textbf{x}))\)) for class y as:

$$\begin{aligned} {\mathcal {I}}_{\textbf{x}, \Pi (\textbf{x})}^{y} \propto \textrm{P}(y|\textbf{x},\Pi (\textbf{x})). \end{aligned}$$
(9)

It can be seen that \({\mathcal {I}}_{\textbf{x}, \Pi (\textbf{x})}^{y}\) is proportional to the conditional probabilities \(\textrm{P}(y|\textbf{x},\Pi (\textbf{x}))\). The conditional probabilities \(\textrm{P}(y|\textbf{x},\Pi (\textbf{x}))\) could be obtained by conducting the inferencing on the \({ { \texttt {BN} } }^{e}\)—the actual learned Bayesian network from generator \({\textbf{G}}\). Note, when \(k=0\), the \({ { \texttt {BN} } }^{e}\) from generator \({\textbf{G}}\) is the naive Bayes and the feature does not have the interaction, so \(\Pi (\textbf{x}) = \Phi \), i.e. it is an empty set. In the case, we have: \({\mathcal {I}}_{\textbf{x}}^{y} \propto \textrm{P}(y|\textbf{x})\).

In order to investigate the interpretation capability of \({ { \texttt {GANBLR} } }\), we used the CAR dataset for multi-class classification with \({ { \texttt {GANBLR} } }\) with \(k=0\) and \(k=1\). Table 11 shows the feature-set importance on 3 instances for \(k=0\) and 2 instances for \(k=1\). Here, the 5 instances from CAR synthetic dataset are randomly selected and used for demonstrating the local interpretability. In order to determine the credibility of the local interpretability of \({ { \texttt {GANBLR} } }\), we employed the popular method—LIME, which can explain why features belongs to a certain class for particular instance. Notably, when \(k=0\), the feature of the instance will not have any interaction, and therefore the \(\Pi (\textbf{x})\) is emptyFootnote 8. For Instance 1 (Table 11), the score of each feature and the class is listed in the Table—note the true class is y = 0, also established by \({ { \texttt {GANBLR} } }\) as can be seen by the probability of class y = 0 which is the highest (shown in bold). Moreover, for Instance 1, the feature Safety = 0 and Persons = 1 contribute the most for this decision (probabilities shown in bold). The results of LIME in Fig. 4 for Instance 1 have the same result which shows that the Safety = 0 and Persons = 1 are the top two contributors. For Instance 2 and Instance 3, similarly, the top contributors from Table 11 are same when compared to the results of LIME in Fig. 4, which demonstrates that \({ { \texttt {GANBLR} } }\) can produce results similar to that of LIME. Note, for Instance 2 and Instance 3, the true classes are 3 and 2, respectively, which are also established by \({ { \texttt {GANBLR} } }\), as can be seen from the bold probabilities in the table.

Instances 4 and 5 in Table 11 depict the case with \(k=1\). Maint = 2, Buying = 3 represents \(\textbf{x}, \Pi (\textbf{x})\) pair, i.e. \(\textbf{x}= \texttt {Maint}\), and \(\Pi (\textbf{x}) = \texttt {Buying}\). Note, for \(\textbf{x}= \texttt {Doors}\), and \(\Pi (\textbf{x}) = \Phi \). For Instance 4, the \({ { \texttt {GANBLR} } }\) makes the decision of y = 0 and the feature of Persons = 1, Safety = 0 has the highest contribution in this decision. Interestingly, the same finding is observed using LIME experiment on the Instance 4, i.e. the highest probability is observed on y = 0 and the feature of Persons = 1, Safety = 0 contributes the most. For Instance 5, the decision for this class y = 3 is based on Persons = 3, Safety = 3, respectively, from \({ { \texttt {GANBLR} } }\). The LIME experiment provides the same finding on the feature contribution.

Based on the above demonstrations of the interpretability comparison of \({ { \texttt {GANBLR} } }\) and LIME, we found similar pattern of coherence of \({ { \texttt {GANBLR} } }\) and LIME’s interpretability, which indicates that the local interpretability of \({ { \texttt {GANBLR} } }\) is highly reliable even during the training phase and is equivalent to the LIME, which is basically interpretation of the model after the training.

4.4.2 Global interpretability

In Fig. 5, global interpretability of \({ { \texttt {GANBLR} } }\) at different training stage can be drawn by listing the weights learned from the generator in \({ { \texttt {GANBLR} } }\). The darker colour means bigger impact, while the lighter colour means smaller impact on the corresponding class. It can be seen that features impact differently on different class labels during training phase, as can be seen at \(epoch = 1\), \(epoch = 50\) and \(epoch = 100\). We can see that features Persons and Safety have the largest impact on the synthetic label 0, while features MaintsPersons and Safety have the largest impact on synthetic label 1, and features PersonsLug_boot and Safety have the largest impact on synthetic label 2. However, feature Safety has far more impact on the synthetic label 3 than other 5 features; therefore, feature Safety is most important factor to decide the car with high value which is the meaning of class = 3. Again, the purpose of this analysis is to demonstrate \({ { \texttt {GANBLR} } }\)’s global interpretable capability.

Table 11 Interpretation of \({ { \texttt {GANBLR} } }\)’s generator weights at \(epoch = 50\) on 5 instances of car dataset
Fig. 4
figure 4

LIME explanation for synthetic data after training using XGBT*

Fig. 5
figure 5

Overall feature impact during the \({ { \texttt {GANBLR} } }\) training

4.5 Ablation analysis

To illustrate the impact of hyper-parameter k and the discriminative component on \({ { \texttt {GANBLR} } }\), we conducted the ablation studies by changing the configuration of \({ { \texttt {GANBLR} } }\) as below:

  • \({ { \texttt {GANBLR-nAL} } }\) As we discussed in Sect. 3.4, the \({ { \texttt {GANBLR-nAL} } }\) does not include the discriminator part and the generator has slightly simplified objective function—i.e. it is trained solely by maximizing \(\log (\textrm{P}(Y_{g}|X_{g}^{k}))\) as shown in Algorithm 2.

  • k = 0 In this experiment, the Bayesian network generator in \({ { \texttt {GANBLR} } }\) has \(k = 0\), i.e. we use a naive Bayes model, this means that no feature interaction is modelled.

  • k = 1 In this experiment, the generator in \({ { \texttt {GANBLR} } }\) is a Bayesian network with \(k = 1\), i.e. order 1 feature interactions are modelled.

  • k = 2 In this experiment, the generator in \({ { \texttt {GANBLR} } }\) is a Bayesian Network with \(k = 2\), i.e. order 2 feature interaction are modelled.

We compare the performance of \({ { \texttt {GANBLR} } }\) and \({ { \texttt {GANBLR-nAL} } }\) using similar strategy that we used to compare \({ { \texttt {GANBLR} } }\) with other competing baselines. It can be seen from Table 12 that \({ { \texttt {GANBLR} } }\) has better average performance than \({ { \texttt {GANBLR-nAL} } }\), especially on large datasets for various values of k, demonstrating its superior machine learning utility. \({ { \texttt {GANBLR} } }\) is better than \({ { \texttt {GANBLR-nAL} } }\) except for \(k=1\) for medium and \(k=2\) for small. This highlights the significance of adversarial component in \({ { \texttt {GANBLR} } }\) formulation. Nonetheless, the comparison also highlights the usage of \({ { \texttt {GANBLR-nAL} } }\) as an effective sampling method which does not employ a game-theoretic adversarial learning.

Table 12 presents the impact of various values of k for \({ { \texttt {GANBLR} } }\). As expected, it can be seen that higher values of k lead to better results for large and medium datasets. However, for small datasets, generally \(k=1\) leads to superior performance. An obvious reason for this is that \(k=2\) might be over-fitting on the small datasets—and traditional bias-variance trade-off is coming into effect. Interestingly, this study revealed suitability of \({ { \texttt {GANBLR} } }\)’s hyper-parameters to various sized datasets.

For statistical similarity comparison, Table 13 shows that \({ { \texttt {GANBLR} } }\) has much smaller difference than \({ { \texttt {GANBLR-nAL} } }\) in terms of JSD distance between the generated synthetic data and the real data. Particularly, on large dataset, with the \(k=2\), \({ { \texttt {GANBLR} } }\) can generate high-quality synthetic data with strong similarity. Similar results can be seen in terms of WD distance, where \({ { \texttt {GANBLR} } }\) can be seen to perform better in general when compared to \({ { \texttt {GANBLR-nAL} } }\) (Table 14). The above findings indicates that the adversarial component in \({ { \texttt {GANBLR} } }\) can help significantly to generate higher quality of the synthetic data especially with higher values of k.

Table 12 Ablation analysis of \({ { \texttt {GANBLR} } }\) and \({ { \texttt {GANBLR-nAL} } }\) with varying interactions \(k = 0, 1, 2\)—Machine learning utility (in terms of TSTR)
Table 13 Ablation analysis of \({ { \texttt {GANBLR} } }\) and \({ { \texttt {GANBLR-nAL} } }\) with varying interactions \(k = 0, 1, 2\)JSD similarity
Table 14 Ablation analysis of \({ { \texttt {GANBLR} } }\) and \({ { \texttt {GANBLR-nAL} } }\) with varying interactions \(k = 0, 1, 2\)WD similarity

5 Conclusion

In this work, we presented a novel technique to generate tabular data utilizing the \({{\texttt {GAN}}}\) strategy. Our proposed \({ { \texttt {GANBLR} } }\) framework relies on discriminatively trained Bayesian networks as the generator as well as discriminator, which learns by optimizing a game-theoretic objective function. We showed that \({ { \texttt {GANBLR} } }\) not only advances the existing SOTA \({{\texttt {GAN}}}\)-based models but also leads to a framework with excellent interpretability during the training. We evaluated the data generation performance of \({ { \texttt {GANBLR} } }\) by comparing it against several SOTA baselines and analysed its performance in terms of machine learning utility as well as statistical similarity. The results show that the synthetic datasets generated from \({ { \texttt {GANBLR} } }\) have the best performance on machine learning utility and statistical similarity comparable to SOTA methods. The remarkable results of \({ { \texttt {GANBLR} } }\) demonstrate its potential for a wide range of applications which can greatly contribute to tabular data generation and augmentation in various sectors such as banking, insurance, health and many other industries. We highlight some future works as:

  • We have constrained ourselves to Bayesian networks with \(k \le 2\) in this work. We are interested to see variation in the performance of \({ { \texttt {GANBLR} } }\) as the value of k is increased.

  • We have focused mainly on restricted \({ { \texttt {BN} } }\) model, i.e. \({ { \texttt {KDB} } }\) models in our current \({ { \texttt {GANBLR} } }\) formulation—we are keen to study the model under unrestricted Bayesian network models.

  • Enhancing \({ { \texttt {GANBLR} } }\) to generate numerical attributes is also one direction, we are exploring.

6 Code

The code of \({ { \texttt {GANBLR} } }\) can be downloaded from: https://github.com/tulip-lab/open-code/tree/develop/GANBLR.