Generative Adversarial Network Performance in Low-Dimensional Settings

A generative adversarial network (GAN) is an artifcial neural network with a distinctive training architecture, designed to create examples that faithfully reproduce a target distribution. GANs have recently had particular success in applications involving high-dimensional distributions in areas such as image processing. Little work has been reported for low dimensions, where properties of GANs may be better identifed and understood. We studied GAN performance in simulated low-dimensional settings, allowing us to transparently assess effects of target distribution complexity and training data sample size on GAN performance in a simple experiment. This experiment revealed two important forms of GAN error, tail underflling and bridge bias, where the latter is analogous to the tunneling observed in high-dimensional GANs.

GAN error measurement, design, stability, and training convergence) that can be used in subsequent experiments with broader scope and more factors. Also, this study emphasizes the primacy of the relationship of GAN error to the size N of the training sample, and our experiment fnds that GAN error is log-log linear with respect to N. Surprisingly, in the cases in our experiment, this relationship extends to even small N, allowing GAN error to be understood succinctly in terms of its error exponent (log-log slope). Finally, our experiment uncovers two forms of GAN error, which we term tail underflling and bridge bias. This demonstrates that GAN mode tunneling (of which bridge bias is the low-dimensional analogue) occurs, and can be studied, in low-dimensional GANs. This is a validation of our choice to study GAN performance in low-dimensional settings.
The remainder of this paper is organized as follows. Section 2 presents important details of our experiment, including the experiment protocol followed and our use of earth mover (EM) distance to quantify GAN error d(L, T ). We also present the architecture of the GAN in our experiment and our GAN training procedure. Section 3 presents and discusses the results of our experiment. Section 4 closes with a summary and some related fnal remarks.

Preliminaries
This section details the way our experiment was conducted. This includes, in particular, the protocol used for the experiment's trials and the particular form of GAN studied. We also discuss our use of EM distance to quantify GAN error.

Protocol for Experiment Trials
The primary objective of our study was to determine how training sample size affects GAN performance for different degrees of target distribution complexity. Our experiment to explore this relationship among GAN error, sample size, and target complexity consisted of trials in which our GAN was trained and exercised over a range of training sample sizes N for each of six target distributions T with varying complexity, including three one-dimensional distributions and three two-dimensional distributions.
All trials in our experiment were conducted according to the protocol diagrammed in Fig. 1. Each trial began with a given target distribution T and a size N for the training sample S. The training sample S was drawn by simple random sampling from T and used to train the GAN. Once trained, the GAN generator synthesized a sample Q of size 10,000 from the GAN's learned distribution L, a second sample R of size 10,000 was drawn from T , and the distance d(Q, R) was calculated. This distance d(Q, R) is an approximation of the GAN error d(L, T ). This whole process was repeated 100 times (a new GAN trained each time), and the GAN error d(L, T ) for the trial was estimated by the average d¯(Q, R) of the 100 d(Q, R).

GAN Fundamentals
A GAN has two components in its training phase, its generator, G θ , and a discriminator, D φ , where θ and φ are vectors of parameters estimated in the learning process. The generator G θ is a mapping from a noise (latent) distribution Z to its learned distribution L. The discriminator D φ assigns a probability 3 that a presented example, whether from T or synthesized, comes from T . Both G θ and D φ are typically implemented by ANNs. These two networks are trained according to a combined loss function [21] At each iteration in training, G θ tries to minimize L (θ , φ ) while D φ tries to maximize L (θ , φ ). Popular GAN variants are distinguished by different choices of f (x) in Eq. (1). For example, the original Goodfellow GAN [2] and the Wasserstein GAN [20] correspond, respectively, to f (x) = − ln(1 + e −x ) and f (x) = x. When the GAN discriminator is optimal, the GAN generator minimizes the distance between the target distribution T and the distribution L learned by the generator; the Goodfellow GAN minimizes the Jensen-Shannon divergence, and the Wasserstein GAN minimizes the EM distance between L and T . We used in our experiment a Wasserstein GAN with a gradient penalty term, called a WGAN-GP [22], added to the loss in Eq. (1). Table 1 summarizes the architecture of our WGAN-GP; this architecture is recommended in Ref. [24]. The generator output dimension is either one or two depending on the dimension of the target distribution in our experiment. We used an Adam optimizer [27] for both the generator and discriminator with a learning rate of 1e-4, with β 1 = 0.5 and β 2 = 0.9. For each update of the generator, we performed fve updates of the discriminator. We found these values to give relatively consistent, stable results during training. Choosing training parameters remains an art based on experience and trial-and-error. The GAN training process is notoriously unstable, and various expediencies have been proposed in the literature to stabilize training [15]. These include changes to the loss function used in training [20,22], tricks for improving GAN training [15,23], and investigations of GAN convergence [24,26]. Our WGAN-GP is known to exhibit training instabilities [24], but we encountered none in our low-dimensional setting. We found with the WGAN-GP that the GAN error d(L, T ) consistently reached steady-state after training for 50,000 epochs. 4 Figure 2 shows the evolution of GAN error with number of learning epochs for a typical trial in our experiment.  Table 1. Architecture of GAN generator (left) and discriminator (right). The generator input is a vector of independent N(0, 1) noises with a length of 128. The leaky-relu function is defned f (

EM Distance
Our study of GAN performance needed a measure of distance between distributions to quantify GAN error. EM distance is attractive for this purpose, and hereafter, d(U,V ) denotes the EM distance between distributions U and V . EM distance is a special case (p = 1) of p-Wasserstein distance, which measures separation among probability distributions and is a metric in the general setting of Radon spaces [25]. Let U and V be two distributions on d-dimensional Euclidean space, where ℜ d is defned by cumulative distribution functions (CDFs) F and G, respectively. The EM distance separating the distributions U and V is where H is any joint CDF on ℜ d × ℜ d such that, marginally, U ∼ F and V ∼ G. EM distance can be understood intuitively as the infmum cost required to move/rearrange a probability mass distributed according to U ∼ F into the distribution of V ∼ G, with Euclidean distance k · k measuring the move required for each infnitesimal of probability mass. The joint CDF H in the infmum in Eq. (2) represents different possible plans for transporting each infnitesimal of probability mass from U to V . EM distance has a long history, stretching back to Monge's 1781 work in transportation theory [28]. Recently, EM distance has been used broadly in computer science, with applications to pattern recognition [29], image databasing [30], and content-based image retrieval [31]. Arjovsky et al. used EM distance to formulate the Wasserstein GAN [20] to address issues of mode collapse and vanishing gradients, two problems inherent in the original GAN framework [2] and of continuing concern. EM distance is just one of many distance measures that can be defned for probability distributions [32], and no consensus has yet emerged for measuring GAN error [33]. Among measures of distance between distributions, EM distance has a powerful and prevailing role in many felds because of its sensitivity to both amount of mass and to underlying metrical, or ground, distance. This feature of EM distance, its dual sensitivity to mass and distance, makes it attractive for our purposes as a direct measure of GAN error.
EM distance can be expressed analytically in some limited cases, and in one dimension the sample S closest in EM distance to T takes a simple form. In general, though, EM distance is found numerically using Sinkhorn's algorithm to solve a regularized version of the basic optimal transport problem [34,35]. We used the Python Optimal Transport Library [36] implementation of Sinkhorn's algorithm in this work to calculate GAN error d(L, T ).

Experiment Results
This section presents and discusses the results of our target distribution complexity experiment in which the relationship of GAN error to training sample size was studied as it varied with the complexity of the target distribution. The experiment included six target distributions, three one-dimensional distributions and three two-dimensional distributions. The three one-dimensional target distributions had one, two, and three modes. As remarked earlier, adding more modes can be thought of as adding more digits in the MNIST data set. The three two-dimensional distributions in this experiment were a Gaussian distribution, an equal mixture of two Gaussian distributions, and a Swiss roll distribution. The Swiss roll distribution, commonly used in machine learning studies [37], is given by the random vector where W 1 , W 2 , and U are independent random variates with U ∼ Unif(0, 9π/4) and W 1 ,W 2 ∼ N(0, 0.01). The density of the Swiss roll distribution is shown in Fig. 3.

Results
The results from the target distribution complexity experiment are plotted in Fig. 4. First, these results confrm that increased target complexity results in poorer GAN performance and greater error. We see this in Fig. 4 for the one-dimensional T in the ordering of higher GAN error with greater number of modes. Figure 4 shows this also for the two-dimensional T in the experiment, provided the Swiss roll T is interpreted as more complex than a distribution with a single point mode but less complex than a T with two point modes. Second, in the log-log format of Fig. 4, the relationship of GAN error d(L, T ) to sample size N appears to be linear or, equivalently, d(L, T ) = 10 a N −b , with error exponent b. This linear relationship might be anticipated to hold asymptotically for large N; in the results in Fig. 4, it appears to apply even at small sample sizes. Analysis of a standard analysis of covariance (ANCOVA) model confrms this. That analysis indicates further that the error exponent b in Eq. (3) differs signifcantly according to the dimensionality of T , but that no statistically discernible differences exist (at α = 0.05, general linear F test [38]) among the error exponents for the one-dimensional T or among those for the two-dimensional T distributions. The common error exponents are estimated to be 0.47 and 0.19, respectively, for the one-and two-dimensional T distributions in the experiment. The GAN errors d(L, T ) shown in Fig. 4 were obtained according to the protocol in Fig. 1. The protocol's size, using 10,000 samplings of Q and R and its numerical approximation of d(Q, R) together impose a noise foor on the GAN error that can be resolved by this approach. This foor can be determined by using the protocol to fnd d(L, T ) in the extreme case where the GAN learns the target distribution perfectly so that L = T with zero GAN error. These protocol-estimated d(T, T ) foors are reported in Table 2 for each target distribution in our experiment. GAN errors below these foors cannot be accurately determined by our protocol with size-10,000 Q and R samples; larger sample sizes would lower these foors and increase the protocol's resolution. The foors in Table 2 show that Q and R samples of size 10,000 are suffcient for our experiment. The measurements of GAN error d(L, T ) in Fig. 4 made according to the protocol in Fig. 1 were produced on the Raritan computing cluster at the National Institute of Standards and Technology. Twelve of the cluster's Nvidia Volta graphics processing units (GPUs) were used in parallel for the experiment 5 . Training a GAN to 50,000 epochs (Fig. 2) took about 45 min. By training 12 GANs in parallel, each GAN error d(L, T ) in the experiment was calculated in 5 to 8 h. The six error curves in Fig. 4, each estimated from 16 error measurements, therefore required a total of about 600 h (25 d) to complete. 5 Certain commercial equipment, instruments, or materials are identifed in this paper in order to specify the experimental procedure adequately. Such identifcation does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the materials or equipment identifed are necessarily the best available for the purpose.

Discussion
The low-dimensional setting of our experiment readily revealed two major sources for the GAN errors presented in Fig. 4: underflled tails and bridge bias. Figure 5 presents three trials with a target standard normal N(0, 1) distribution, using training sample sizes N ranging from 10 to 1000. Fig. 5 shows that at small training sample sizes N, the GAN's learned distributon, represented by the examples in red, underflls the tails of the target distribution (in black), and this error diminishes as N increases. Figure 5 also suggests that this is not inherently a problem with the GAN. The GAN can do no better than the training data available to it, and the poor fdelity of the training sample (in blue) at small N is at least partially the origin of the underflled tails.  Figure 6 shows results of a GAN trained on a bimodal target distribution (in black) made of an equal mixture of two normal distributions N(−5, 1) and N(5, 1). The target distribution T has a low-density bridge connecting its two modes. Figure 6 shows that for small N the GAN's learned distribution (examples in red) over-estimates the bridge density, even though the training sample (in blue) actually underrepresents the target bridge. The side experiment described next suggests that this bridge bias is a GAN structural feature that is only slowly corrected by increasing the size N of the training sample. A side experiment separate from our main target complexity experiment was conducted to explore the bridge bias seen in Fig. 6. This side experiment looked at only the univariate bimodal target distribution T of Fig. 6, estimating the size of the bridge in the GAN-learned distribution L over a set of training sample sizes N. The results of this side experiment are presented in Fig. 7, where the proportion of L in T 's bridge 6 is plotted against N. These results show diminishing bridge bias as the training sample size increases. However, extrapolation of the regression line (solid line) in Fig. 7 suggests that very large sample sizes would be needed-even in this simple one-dimensional learning problem-to approach the target bridge proportion (dashed line) and effectively eliminate the bridge bias in L.
The trimodal target distribution in Fig. 8 shows that for a given sample size N, bridge bias worsens in the presence of multiple target bridges. Figures 6 and 8    The GAN bridge bias seen with our one-dimensional multimodal target distributions appears also with our two-dimensional distribution T . Figure 9 shows GAN-generated samples for the target bimodal mixture of bivariate Gaussian distributions in our experiment. In each of the three cases in Fig. 9-N = 10, 100, and 1000-the learned bridge between the two modes greatly over-estimates the (very low) density of the bridge in the target distribution. This bias is greater than that for our one-dimensional target distributions. Also, while the bridge bias for two-dimensional T decreases with N, this decrease is slower than that in the corresponding one-dimensional T . Interestingly, this bridge bias in two dimensions is restricted mostly to the one-dimensional line between the two modes, meaning that while the bridge bias for two-dimensional T is more pronounced compared to that for one-dimensional T , it is also more restricted relative to the support of T . Experiments to explore the relationships among amount and extent of GAN bridge bias, GAN architecture, training parameters, training sample size, and target complexity (dimensionality and multimodality) are needed. The univariate and bivariate bimodal target distributions in Figs. 6 and 9 have point modes and the GAN bridge between the modes is essentially one-dimensional. The Swiss roll distribution in our complexity experiment affords us an opportunity to see what can happen when a target mode extends beyond just a point. Figure 10 presents scatterplots of GAN-generated samples for the Swiss roll target distribution in our experiment. The training sample of size N = 10 is not enough for the GAN to learn the submanifold on which the target ridge resides and it reverts to identifying modes with heavily biased bridges. With a relatively small training sample size N = 100, though, the GAN has discovered this structure and almost all bridge bias is gone. Bridge bias could be understood to arise from the fact that the generator is a continuous map from the latent space of the GAN input noise distribution to the support (data space) of the target distribution, while a bimodal target distribution with no bridge (zero mass between modes) would require a discontinuous map between the latent space and the data space. In other words, GAN bridge bias may be an artifact of a continuous approximation to a discontinuous function. Our experiment found that GAN performance decreases as modes are added, and we identifed bridge bias as a signifcant cause. In fact, a GAN bridge can exist between modes even in the absence of any corresponding target bridge. These GAN bridges may be the source of low-fdelity realizations synthesized by GANs trained for image generation [39]. A conditional GAN can alleviate bridge bias to some degree. A GAN is trained in an unsupervised fashion, with no labels or other identifying information attached to the examples in the training sample S. A conditional GAN is trained with labeled data, and if these class labels line up well with target modes, bridge bias can be reduced. To see this, consider the trimodal target distribution T shown (top) in Fig. 11. Training examples drawn from T are identifed (red or blue) depending on the mode from which they derive. The color labels unambiguously identify the left mode, but they do not resolve the center and right target modes. This is akin to having labels for images of cats and dogs, but no labels specifying dog breed. A GAN trained with labeled data-a conditional GAN [40]-learned the distribution shown (bottom) in Fig. 11. The left and center modes from different classes have no appreciable bridge, while the center and right modes from the same class have a bridge. Of course, labeled data are not always available, and when labels are available they may not correspond to different modes. Still, further study is warranted to discover when and to what extent a conditional GAN with labeled training data can alleviate bridge bias.

Summary and Related Remarks
GAN performance studies in the setting of high-dimensional applications have made only limited progress on pressing problems associated with GAN training, including mode collapse and training instability. In this study, to make progress on these and other questions relating to GAN performance, we took a fresh approach and considered GAN performance in low-dimensional settings. This approach offered important advantages: a reduced computational burden in experiments, more comprehensible and malleable target distributions, and easier assessment of GAN error. Our low-dimensional approach also carried risk. A low-dimensional GAN may not reproduce the high-dimensional phenomena that need to be understood. Encouragingly in this regard, our experiment reveaeds bridge bias in trained GANs, analogous to that seen in high dimensions.
A major purpose of our study was to establish protocols for GAN design and experimentation that fully exploit the advantages of low dimension and that can be used in subsequent, more elaborate experiments with low-dimensional GANs. Beyond this, our work makes two contributions. First, our work highlights the perspective of GAN error as a function of training sample size, because our experiment shows that this relationship is log-log linear and that the GAN error exponent (log-log slope) depends solely on the dimension of the target distribution. Second, our experiment uncovers two prominent forms of GAN error, tail underflling and bridge bias, fnding that both decrease only slowly with increasing sample size.
This initial study of low-dimensional GAN performance offers a framework for further investigation in many directions. Target distribution complexity can be varied by dimension and by number, distribution, and dimensionality of modes. Our experiment found, for example, that doubling the dimension of the target support from 1 to 2 roughly halved (from 0.47 to 0.19) the GAN error exponent. Determining whether and how this effect scales to higher dimensions has important implications for defning how well GANs can reasonably be expected to perform in many application settings. Also, investigation into the relationship between bridge bias and bridge length (spacing between modes) is needed; all else remaining equal, modes tend to be more separated in higher-dimensional data spaces.
All the GAN training samples in our experiment were drawn from the target distribution by simple random sampling. Other sampling schemes can be envisioned; one such scheme is stratifed random sampling, in which the training sample is assembled from simple random samples drawn from each target mode. This higher-fdelity sampling scheme could be expected to yield reduced GAN error for any given sample size N; one would like to determine whether such a sampling scheme would, more powerfully, increase the GAN error exponent. The effect of training sample fdelity could be tested to its limit by studying training samples S that minimize the EM distance d(S, T ) separating S and the target distribution T .