Variations in Variational Autoencoders - A Comparative Evaluation

Variational Auto-Encoders (VAEs) are deep latent space generative models which have been immensely successful in many applications such as image generation, image captioning, protein design, mutation prediction, and language models among others. The fundamental idea in VAEs is to learn the distribution of data in such a way that new meaningful data can be generated from the encoded distribution. This concept has led to tremendous research and variations in the design of VAEs in the last few years creating a field of its own, referred to as unsupervised representation learning. This paper provides a much-needed comprehensive evaluation of the variations of the VAEs based on their end goals and resulting architectures. It further provides intuition as well as mathematical formulation and quantitative results of each popular variation, presents a concise comparison of these variations, and concludes with challenges and future opportunities for research in VAEs.


I. INTRODUCTION
Data generation, due to the scarcity of training data, is a fundamental problem in many areas of artificial intelligence such as computer vision pattern recognition and natural language processing [1]. In recent years, deep generative models have gained a lot of attention due to numerous applications in deep learning. Among them, VAEs [2] and Generative Adversarial Networks (GANs) [3] are regarded as the two most popular approaches to generative modeling.
The VAE can be regarded as a mixture of an encoder and a decoder Bayesian network. The encoder maps an input data (e.g., an image) x to a latent vector z, and then the decoder maps the latent vector z back to image or data space [4]. VAEs 1 enhance a normal Autoencoder (AE) by adding a Bayesian component that learns the parameters representing the probability distribution of the data. This is achieved by imposing a prior on the probability of the input, modeled typically as a unit Gaussian random variable. This implicitly results in a regularization that can be used to explain the probability of the input. Thus, the VAE is a generative model The associate editor coordinating the review of this manuscript and approving it for publication was Feng Shao . 1 https://github.com/VAEs-Tutorial/paper that can sample from the latent distribution produced by the encoder and generate new input data via the decoder. VAEs do not suffer problems encountered in GANs, mainly: non-convergence causing mode collapse, and are hard to evaluate [3], [5], [6]. What's more, VAEs have decent theoretical guarantee: first, by introducing the variational lower bound, the complicated calculation of the marginal likelihood probability is avoided. Second, by the reparameterization trick, the complicated Markov chain sampling process of latent variable is avoided. A key benefit of VAEs is the ability to control the distribution of the latent representation vector z, which can combine VAEs with representation learning to further improve the downstream tasks. VAEs are able to learn the smooth latent representations of the input data [7] and thus can generate new meaningful samples in an unsupervised manner. These properties have allowed VAEs to enjoy success especially in computer vision, e.g., static images generation [8], zero shot learning [9]- [11], image super-resolution [12], [13], and semantic image inpainting [14], [15].
Despite the above-mentioned advantages of VAEs, they do have some constraints: 1) the generated images tend to be blurry, 2) latent representation does not have an interpretable meaning, 3) the popularly used Gaussian distribution VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ as priori has limitations because the learnt representations are unimodal, and do not allow for different or mixed data distributions, and 4) the Gaussian definition is based on the L 2 -norm that suffers from the curse of dimensionality. In order to solve the above problems, researchers have proposed many variations of the VAEs based on different task requirements such as feature learning and deep clustering with the goal of greatly improving the quality of the generated data. Current VAE research focuses primarily in three directions: 1) improving the disentanglement for VAEs, 2) applying custom VAEs to real-world applications, and 3) improving the quality of generated images. Many VAE-variants have been proposed in the following categories: 1) architecture-variant, such as VAE-GAN and CVAE, 2) regularizing posteriorvariant, posterior regularization to improve disentanglement capability, 3) prior-variant, prior-variance based on data distributions to improve the Bayesian VAE model. In the following sections, we provide details on the VAE-variants implemented with the above categorizations.
In this paper, we focus on the recent advances in VAEs as these provide an elegant statistical approach to meaningful data generation resulting in an entire field of its own referred to as unsupervised representation learning. We study the existing VAE-variants and provide a comprehensive analysis and comparisons between different approaches. The rest of the paper is organized as follows: -We present an overview of the conventional VAE.
-Variations of the VAE are described mathematically along with their differences, pros and cons.
-We conduct experiments on MNIST dataset and perform comparative analysis.
-We conclude this review with some future directions for advancement in this area.
The structure of our paper is organized as follow: Section II describes some background work about VAEs. Section III explains variants of VAEs in detail. Section IV provides comparative analysis of experimental results and analysis on the MNIST dataset. Section V describes summary of Variations of the VAE along with their differences, pros and cons. Conclusion and future work is given in Section VI and references are delineated at the end.

II. PRELIMINARIES
The following sub-sections introduce the theory behind Autoencoders, deep generative models, and conventional VAE. Additionally, we discuss the variational bound and the reparameterization trick.

1) AUTOENCODERS
An Autoencoder (AE) is an unsupervised learning system where during training the expected output is an approximation of the input. AE is primarily applied to data dimensionality reduction, image classification, object detection, and image denoising [8], [16]- [18]. An AE consists of the following parts [19]:

2) ENCODER
A neural network that produces a compressed latent space representation of input data.

3) LATENT SPACE
Captures input to a knowledge representation, that is, to reduce the dimensionality of input such that maximum information is preserved in it.

4) DECODER
A reconstruction of the input data from the compressed latent space.
As shown in Figure 1, the encoder h encodes the original input X into a latent space Z . The decoder f decodes the latent space Z to recreate an approximation of the original data X such that X = f (Z ) = f (h (X )). After repeated training, the AE attempts to reproduce a copy of the input as the output. The application of an AE has two main aspects, the first is data denoising, and the second is dimensionality reduction for removing redundant or unimportant features. In other words, the output is made approximately equal to the input with some constraints on the AE algorithm. These constraints force the encoder to consider which parts of the input need to be preserved and which parts can be discarded. Therefore, the Autoencoder can often learn the meaningful features of the data and discard the irrelevant features. It is well known that an AE accomplishes dimensionality reduction similar to a nonlinear PCA. In classification application of an AE, the decoder section is removed after AE is trained, and replaced by a classifier network. An AE is not capable of generating new data as the latent space it produces is not regularized to aid in new data synthesis.

A. DEEP GENERATIVE MODELS
The most common model in machine learning is the discriminant model. The discriminant model [20] refers to the inference of certain features of the data based on the original dataset, and then use these features to construct the corresponding application model e.g., a classifier. On the other hand, a generative model aims to learn the features of the input and recover the original data or generate similar data from a latent space distribution.
Deep Generative models use distribution estimation and sampling to achieve generation of new data [21]. To explain this further, suppose in a continuous or discrete highdimensional space, there is a data x obeying some unknown distribution P data (x), and it is necessary to estimate the unknown distribution P model (x) by observing part of the data samples of the set X . The deep generative model generates an estimated distribution by approximating and learning the unknown distribution P data (x) from some training data and allows new data to be generated from the estimated distribution P model (x).
Traditional popular deep generative models belong to Boltzmann family i.e. Deep Belief Networks (DBNs) [22] and Deep Boltzmann Machines (DBMs) [23]. However, one major limitation of them is high computational cost during inference process [24]. Latest deep generative networks are VAEs and GANs. In this paper, we focus on VAEs and its variants.

B. VARIATIONAL AUTOENCODER (VAE)
A Variational Autoencoder (VAE) is a special autoencoder based on the variational Bayes inference originally proposed by Kingma and Welling [2], Doersch [4]. The goal of a VAEs is to be able to learn the distribution of the training data so that by sampling from it, we can generate new data. Since the training data may not necessarily have a welldefined mathematical distribution, we force the distribution of the output of the encoder (known as the latent space) to follow a known distribution e.g., normal distribution. Figure 2 shows the architecture of a VAE that has an encoder and a variational inference network, followed by the decoder that samples from the latent space to generate the output. The main difference between AE and VAE is the AE learns the compressed representation of the input, and its decompression to match the given input. In contrast, the VAE is a Bayesian model which learns the compressed representation of the AE, and constructs the parameters representing the probability distribution of the data. It can sample from this distribution and generate new input data samples. Therefore, VAE is a generative model, where as an AE which just does reconstruction does not have an obvious generative interpretation.
If the original dataset is X = {x i } N i=1 , then each data sample x i is a randomly generated, independent, continuous or discrete distribution variable, and the regenerated dataset at the output is X = x i N i=1 . Suppose the encoding process produces a latent variable z, then, the observable variable X is a random vector in a high-dimensional space, and the unobservable variable Z is a random vector in a relatively low-dimensional space.
In the implementation of the VAE, the encoder is a neural network whose input is a datapoint x, its output is a latent representation z. We represent its weights and biases as a model φ. The decoder is another neural net whose input is the latent representation z and outputs the parameters of the probability distribution for the data. The decoder's weights and biases are represented as the model θ. Suppose we want to approximate a distribution p (Z |X ) with some q (Z |X ) distribution via the Kullback-Leibler (KL) divergence, then by definition of KL, If we minimize the KL divergence as follows: using Bayes rule: Since expectation is with respect to Z , Since D KL is always positive, we can conclude that: Equation 9 is an important result and is known as the Evidence Lower Bound (ELBO). In a deep neural network implementation of a VAE, equation 9 is used as the loss VOLUME 8, 2020 function during training of the network. The E[ log p (X |Z )] term denotes the reconstruction i.e., the generation of output from the latent representation z. The D KL [q (Z |X ) p (Z ) ] measures the similarity of the distribution of the latent space with the target distribution p(z). Thus, the two components of equation 9 try to make the output similar to the input while keeping the distribution of the latent space as close to the target distribution p(z) as possible.
The ELBO is tight if q (z) = p (z|x), indicating that q (z) is optimized to approximate the true posterior. For scalability to larger datasets, we do not optimize q (z) for every data point X . Instead an inference network q (z|x) is introduced that is parameterized by a neural network that outputs a probability distribution for each data point X . Therefore, the final objective is to maximize:

1) REPARAMETERIZATION TRICK
According to the objective described in equation (10), after we introduced q φ (z|x) to approximate p θ (z|x), if we want to sample Z from q φ (z|x), an easy choice is to assume that q φ (z|x) obeys the Gaussian distribution and that the sampling of Z can be done in the following reparameterization way [25]: where is an auxiliary noise variable such that ∼ N (0, 1) i.e., let q (z|x) be a Gaussian with parameters µ(x) and (x). Then the KL divergence between q (z|x) and p(z) can be computed in closed form as follows: Replacing (x) with e (x) The reparameterization can make the relationship between latent variable Z , σ and µ change from sampling to a numerical calculation such that it can be optimized directly by using stochastic gradient descent [26]. The main purpose of the reparameterization trick is to make back propagation possible. Conditional distribution p θ (z|x) obeys Gaussian distribution and the mean and standard deviation can be calculated by the neural network; thus, each component of the lower bound of the variation can be directly calculated, and the model structure can be determined.

2) DISENTANGLEMENT AND REPRESENTATION LEARNING
Although our world is inundated with data, a large part of the data is still unlabeled and unorganized. One of the challenges of artificial intelligence is to learn useful representations using unsupervised learning methods. The performance of models can be improved by selecting different representations to adjust the difficulty of machine learning [27]. Feature engineering [28] is one of the methods that can refine the representations from raw data. Feature engineering refers to transforming raw data into advanced training data representations. However, in machine learning, manually selected features rely on human and professional knowledge, which is part of the most time-consuming and energy-intensive work, and its weakness is the inability to extract and organize discriminant information from the data. Therefore, in order to improve the scope and ease of use progress in artificial intelligence, we need to promote the work of feature engineering more quickly and effectively by relying less on feature engineering. Representation learning can learn useful disentangled representations automatically.
Representation learning is done by the meta-priors proposed by Bengio et al. [7]. The goal of representation learning is to be useful for downstream tasks. At present, research on successful representation learning includes speech recognition [29], signal processing [30], object recognition [31], and natural language processing [32]. The most important metaprior is called ''disentanglement'' which is an unsupervised learning technique that breaks down, or disentangles, each feature into narrowly defined variables and encodes them as separate dimensions [7]. Assuming that the data is generated from independent factors of variation, and if the VAE is trained to reconstruct the sample well, then the latent space between the encoder and decoder keeps the important information of the original data.
Intuitively, a factorial code disentangles the individual elements that were originally mixed in the sample, just as humans recognize complex things by disentangling independent elements. If the dimensions of the latent vector are independent of each other, it is factorial disentangled, i.e., a good representation.

A. InfoVAE
Regularization of the encoding distribution is often used to encourage disentanglement representations of the latent variables z. The fundamental approach taken in recent research on disentanglement is to augment the VAE loss with regularizers, such as reweighting the ELBO. InfoVAE [33], also known as MMD-VAE, is a variant of VAEs that can lead to improved unsupervised representation learning based on regularization of the largest mean difference between distributions. The goal of InfoVAE is to do the representation learning by encouraging a large mutual information between Z and X by adding a regularizer of maximum mean discrepancy. The maximum mean discrepancy (MMD) [34] was first proposed for the two-sample test problem to determine if the two distributions p and q are the same. Its basic assumption is to define unspecified function classes F to measure the disparity between p and q. If enough samples generated by the two distributions have equal mean on F, then the two distributions are similar.
The MMD is taken as a test statistic to determine whether the two distributions are similar. Such MMD-based regularization can lead to disentangled latent representation resulting in the following modified form of objective function [35]: where R1 and R2 are regularizes and λ 1 , λ 2 > 0 the corresponding hyperparameter weights. The MMD-VAE starts from an alternative way of writing L VAE : Zhao et al. [33] suggest to boost a major mutual message between z∼q (x|z) and x by putting in a regularizer I q φ (x,z) to the formula above and reweight the first term, resulting in the final objective as: Figure 3 shows the architecture of the convolutional neural network (CNN) VAE model for the MNIST dataset which has been utilized for MMD-VAE. This structure is based on Deep Convolutional networks which includes fully connected layers (FC) and convolutional (Conv) layers. The size of the input image of the encoder neural network is 28 × 28 × 1, and the input image passes through two Conv layers and the last FC layer till the latent variable space is reached. The two convolutional layers in the encoder network achieve feature maps dimensionality reduction using stride of 2 and a kernel size of 4 × 4. The two parallel feature vectors obtained by flattening the feature map of the second convolutional layer are µ and σ 2 , respectively. For a general implementation, the number of neurons in the fully connected layer is a model decision and represents the dimension of the latent space.
The generative model p (x|z) takes the sampled latent variables z received by µ and σ 2 and using the reparameterization trick feeds it through two FC layers and one Conv layer until a reconstructed output is obtained. The FC layers in the decoder reshape the latent variable z to 7 × 7 × 128, and finally use a stride of 2 and a kernel size of 4 × 4 in the deconvolution to obtain the reconstruction image.
Disentanglement quality of inference models is typically evaluated based on the ground truth factors of variation  (if available). Specifically, disentanglement metrics measure how predictive the individual latent factors are for the groundtruth factors [36]. By comparing different models on metrics of performance, stability and training speed, and evaluating and comparing possible types of divergences, InfoVAE with MMD regularization had better performance metrics and demonstrated stability over traditional VAE [33]. InfoVAE-MMD provides a good way to handle latent code ignorance issues [33]. However, some of the drawbacks include the often-blurred image generation, as shown in Figure 4, samples generated by InfoVAE of MINST data.

B. β-VAE
Another unsupervised method that can automatically discover disentangled factors in latent variable space based on VAE framework is β-VAE [36]. The basic principle of β-VAE is to reweight the ELBO of the model with additional parameter β as the D KL weight. The ELBO can be expressed as: This constraint limits the ability of latent information channels and emphasize learning the statistically independent VOLUME 8, 2020 latent factors. Combining the maximum likelihood objective function with the generated model, allows the model to obtain the most useful latent features of the input data. If the data was generated by some independent dimension of variation, it will be disentangled.
Compared to the unmodified VAE framework, this easy revise permits β-VAE to remarkably enhance the performance of disentanglement in learning representation [35]: where λ 1 = β − 1 > 1 is the corresponding weight. This regularization causes q φ (z|x) to better match the a priori p(z) which conversely restricts the implicit capacity of the latent feature z∼ q φ (z|x) and causes it to be disentangled. Note that the β-VAE with β = 1 is equivalent to a standard VAE. β-VAE implements the representation of disentanglement by selecting the appropriate hyperparameter β. This simple penalty has proven to be able to obtain models with a high degree of disentanglement. However, it is not explicitly stated why using the factor a priori penalty on KL(q(z|x)||p(z)) helps in encoding latent variables with a disentangled representation of the data. Recently, the authors in [37] found that ELBO has a decomposition that can be used to explain the success of β-VAE in learning to solve disentangled representations. Specifically, the total correlation (TC) penalty in the loss function encourages the model to find statistically independent factors in the data distribution. In information theory, TC is a kind of generalizations of mutual message and is the amount of information shared between variables in the collection. It is also referred to as multiple message or multivariate constraint. TC quantifies the dependency or redundancy between a group of stochastic variables. In β-VAE, the penalty of TC forces the model to find statistically independent factors in the data distribution. This leads to the learning of latent variables that exhibit a disentangled transformation of all data samples, and thus the existence of the term is the reason for the success of β-VAE.
The β-VAE has a relation to Info-VAE because the Info-VAE family generalizes β-VAEs [33]. β-VAE can be transformed from INFO-VAE by setting λ 2 in equation (18) to 0. The disadvantage of β-VAE over previous INFO-VAE is that the β-VAE model cannot effectively penalize the weights and information preferences of X and Z , resulting in underfitting or ignoring the latent variables. Specifically, for each λ, INFO-VAE can choose a unique value. If we choose a larger value of λ ≥ 1 to balance the importance of the observation space and the latent space X and Z ,we must also choose α ≤ 0, which forces the model to penalize mutual information, thus avoiding under-fitting or ignoring the latent variables.
After the latent variable generation factor is known and disentangled, the indicator for evaluating the disentangled performance requires a supervised classifier-based evaluation metric. Overall, β-VAE tends to find more latent factors consistently and learns more clearly the characterization of disentanglement (as shown in Figure 5) [38]. In addition, β-VAE does not require a hypothesis of the distribution of the data, and the training procedure is very steady.

C. VQ-VAE
In machine learning, in addition to learning based on continuous features [22], [39]- [41], there is also learning based on discrete representations [23], [42]- [44]. Discrete representations are naturally suitable for complex reasoning, planning, and predictive learning. Although the use of discrete latent variables in deep learning has proven challenging, powerful autoregressive models have been developed for modeling distributions on discrete variables [45].
The main purpose of VQ-VAE is to learn discrete latent variables. VQ-VAE is implemented using a vector quantization (VQ) algorithm. We know that quantization can divided into scalar quantization and vector quantization (VQ). Scalar quantization samples signal values and quantizes them one by one. Vector quantization divides several sampled signals into a group, thus simplifying the amount of data. Specifically, for each latent variable, we look for points within a certain range around it to represent it, so that we can treat the latent variables as a k-dimensional vector. Vector quantization is an extremely important method of signal compression, which is widely used in speech coding, speech recognition and synthesis, image compression and other fields.
In VQ-VAE, each latent embedding vector e i is a vector in a d-dimension latent space, and the size of the discrete latent space of k such vectors are learnt, together with the rest of the model parameters (as shown in Figure 6).
The posterior q φ (z|x) is implemented as one hot vectors  where z e (x) is the output of the encoder, the embeddings e k can be learned individually for each latent variable z j . The principle of the VQ-VAE sampling procedure is based on autoregressive distribution [45]. In the autoregressive model, the target variable is predicted based on a combination of historical data of the target variables. After training, the autoregressive distribution is fitted over z, p (z) to generate X by an ancestral sampling.
VQ-VAE can achieve good reconstructions [45] (Figure 7) as compared to conventional VAEs. In addition, the image contains a lot of redundant information, because most pixels are correlated and noisy, so a pixel-level learning model can be wasteful. When applied to training language data, VQ-VAE learns the basic phoneme-grade speech model in a fully unsupervised manner for controlled speech generation and phoneme classification [46].
Structures with discrete latent variables are greatly reduced by discrete coding, and reconstructions appear to be slightly blurry compared to the original input. However, in some welltrained VQ-VAEs (i.e. high-entropy), parts of the codebook may be lost. The model will suffer from codebook crashes and will no longer use the full capacity of the discrete bottlenecks, leading to worse likelihoods and poor reconstruction. The reason for this phenomenon is not clear, it can be noted that the K-means and Gaussian Mixture model algorithms may have similar problems [47].

D. CLUSTERING VAE
Cluster analysis is an unsupervised learning method which aims to learn training samples without classification markers and to reveal the intrinsic properties and laws of the data. Mathematical methods are used to study and deal with the classification of given objects and the degree of closeness between the categories. Specifically, cluster analysis divides the data set into several subsets, and the elements in each subset have higher similarity to the elements in the subset under certain metrics. The subsets that are divided in this way are ''clustered'', each of which represents a potential category. The distinction between classification and clustering is that classification is to first determine the category and then divide the data; clustering is to first divide the data and then determine the category.
From a machine learning perspective, cluster analysis is an unsupervised learning method where the classes are not given in advance but are created according to the similarity and distance of the data [48]- [50]. The structure of the clusters is not presupposed, but the number of clusters can be proposed. The purpose of the clustering algorithm is to find potential natural grouping structures and relationships of interest in the data. Clustering has been widely used in various fields of engineering and science. In general, the clustering method is mainly the measurement of data's groups based on similarity or dissimilarity [51], which can be divided into direct and indirect methods. The direct method is based on the similarity clustering of the original input, mainly by measuring a certain metric between the samples to achieve clustering. The indirect method applies the metrics on the features generated from the original data.
Recently, deep clustering has become one of the popular approaches to achieving good learning representations. Deep embedded clustering (DEC) [52] among others have been proposed to make deep clustering a popular research field. Deep embedded clustering (DEC) uses deep neural networks to learn the representations, and then uses clustering algorithms to perform cluster analysis on the generated features. The data is usually mapped to the representation space and then fed straight into the clustering model. In order to generate meaningful data samples, the generative models need to have two purposes: one is to seize the statistical architecture of the data, and the other is to generate data samples. DEC acts fine in clustering, however, in some cases it poorly models the generative procedure of data, so it does not generate good quality samples. Therefore, there is a need is to develop a better deep clustering model which: 1) learns to capture a good representation of the statistical structure of the data, and 2) is able to generate samples.

E. VARIATIONAL DEEP EMBEDDING (VaDE)
The Variational Deep Embedding (VaDE) [53] is one of the techniques utilized for both data clustering and generation. It is an extension of the variational autoencoder that applies the Gaussian Mixture Model (GMM) [54] on the latent variables for clustering purpose(as show in Figure 8). The GMM defines the probability density function as multiple Gaussian density weighted sums. One of the most common ways to estimate GMM parameters is Maximum Likelihood Estimation or Expectation Maximization (EM) [55]. The benefit of GMM is that it can generate samples by estimating the data density.
The generative model for VaDE can be formulated as [56]: where c [1, K ] is the distribution of the weights of the Gaussian terms in the GMM (parametrized by π), and K is the number of classes which are predetermined, µ and σ 2 are parameters of the elements in the clusters. Ber x|µ x and N (x|µ x , σ 2 x I ) are multivariate Bernoulli distribution and Gaussian distribution parameterized by µ, and σ 2 . The encoder model can be stated as: VaDE maximizes the evidence lower bound (ELBO) using Jensen's inequality: where q (z, c|x) is the group member probability of observed variable x to class c. The first term in L ELBO is the reconstruction loss L n , and the second term is the clustering loss L c , which is the Kullback-Leibler divergence between the distribution of the observed sample and the Mixture of Gaussian (MoG) prior. After training, the class can be inferred from the MoG latent space. VaDE is an unsupervised clustering model. The number of clusters of a VaDE can be set to the number of classes in each dataset, or a different number of clusters K can be selected. If K is less than the total number of classes in the dataset, numbers with similar appearances will be grouped together.
On the other hand, if K is greater than the number of classes, some numbers with the same appearance will be divided into subclasses.
Samples of generated digits from MNIST dataset is shown in Figure 9.

F. GAUSSIAN MIXTURE VAE (GMVAE)
Although VaDE is simple and performs GMM on the latent space for clustering, it cannot be considered as a real GMM for data generation due to having independent gaussian distributions as the prior. However, because the best choice of prior distribution is the one with the ability to describe the distribution of clustering latent structures, the authors in [57] proposed the prior distribution p (z) to be a GMM that depends on another two latent spaces w, and c. This approach has advantages over the regular VAE as the GMM can capture the clustering representations of data that are not necessarily unimodal.
GMVAE uses Gaussian mixture model as a priori for latent encoding space and defines a generative procedure that formulates a variational Bayes optimization objective (as shown in Figure 10). It supposes that the sample is generated by a Gaussian mixture and can infer the class of data points from the latent variable spaces. After optimizing the ELBO, the learned GMM model can infer the cluster allocation from the latent spaces.
This GMVAE algorithms can cluster the given data, and generate images, but because of the overhead of the extra latent variables, it typically has high computational complexity than other deep clustering techniques. The generative model for the GMVAE can be expressed as [56]: where, K is the number of clusters, w is the regular latent variable, c is the label latent variable, z is the GMM latent variable, and x is the generated data. For the recognition (cluster inference) step, the trained networks (Encoder 1 (E1), Encoder 2 (E2), and Network 2 (N2)) are working to approximate the posterior distribution q (z, w, c|x) which can be factories for each network parameters as: where i is the index for training data, E 1 produces latent space w, E 2 generates latent space z, and N 2 is the classification network as shown in Figure 10. The training loss for the GMVAE can be expressed as: where the loss terms are composed of: reconstruction term, w-prior term, conditional prior term, and c-prior term respectively. Comparing the loss function of the GMVAE to the VaDE, it can be seen that the VaDE is slightly less complex because there is no need to sample an additional w. As seen in Figure 11, images generated from the GMVAE have better quality than the ones generated from VaDE algorithm.   [58] is a combination of the VAEs and GANs into an unsupervised generative model. VAE-GAN transforms the features of the image learned by the discriminator into the reconstruction error of the VAE. The basic idea of this model is to improve the fidelity of the output of VAEs. Since images generated by a VAE are usually blurred, the GAN component can ensure the trueness of the generated image. VAE-GAN is built on the VAE structure with a GAN discriminator [59] added after the decoder to ensure that the samples generated by the VAE have high quality (as shown in Figure 12).

VAE-GAN
The objective function of VAE-GAN is to minimize the loss function L that is comprised of the VAE components and the GAN components as: where, L prior represents the KL divergence of the prior in a VAE with the latent distribution q (x|z): The second term L D l llike represents the reconstruction loss. It replaces the typical VAE reconstruction loss (expected log likelihood) with a reconstruction error expressed in the GAN discriminator. D l (x) denote the hidden representation of the l th layer of the discriminator. Therefore, a Gaussian observation algorithm for Discriminator (x) with identity covariance and mean Discriminator (x') is proposed.
where x'∼ Decoder(z) is the sample from the generative model of x. Thus the reconstruction loss becomes: The above equations assume that the lth layer of the discriminator produces outputs that differ in a Gaussian manner. Thus, the mean squared error (MSE) between the lth layer outputs gives us the VAE's loss function. VOLUME 8, 2020 The third term in equation (28) is the loss in the GAN part of VAE-GAN. The goal of a conventional GAN is to find a binary classifier that distinguishes between the real data and generated data while encouraging the generator to fit the real data distribution, i.e. traditional GAN loss is defines as: L Gan = log (Dis (x)) + log (1 − Dis (Gen (z))) (32) However, since the GAN in VAE-GAN receives input from the encoder q (z|x) , the GAN loss becomes: L Gan = log (Dis (x)) + log (1 − Dis (Gen (z))) +log (1 − Dis (Dec (Enc (x)))) (33) Since VAE-GAN combines the VAE and the GAN, it has a good effect in image synthesis, effectively overcoming the fuzziness generated by regular VAEs (as shown in Figure 13).

H. F-VAEGAN-D2
The human visual system is superior to the spectral camera system most of the time due to its physical and physiological characteristics. These great features are built on at least two foundations. The first is the brain: about half of our brain is directly involved in the processing of visual information [60]. Second, basic visual skills are learned in a long process that runs through the first few years of life [61]. For example, newborns can distinguish certain patterns based on statistical features such as space or contour. Infants can notice simple rough geometric relationships, and they do not always focus on contours and shapes. At about two years old, children begin to discover higher-order geometric relationships. Here, the term ''visual feature learning'' refers to basic features (e.g., color, shape) and non-basic features (e.g., different directions).
In deep learning, due to its powerful ability to learn general visual features at different levels, deep neural networks have been used as the basic structure of many visual feature learning on Computer Vision, such as object detection [62]- [64] semantic segmentation [65]- [67], etc. Among the deep learning models, with complex architectures and large-scale data sets, convolutional neural network models such as AlexNet [68], VGG [69], GoogLeNet [70], ResNet [71], and DenseNet [72] constantly break through the latest level of many Visual Feature Learning tasks [73]- [77] in computer vision. They are based on learning Visual features of images through CNN and rely on pairs of image features and class attributes. However, the collection and annotation of large-scale data sets is time-consuming and expensive. Therefore, in order to avoid time-consuming and expensive data annotation, in recent years, many studies have emerged through unsupervised learning methods that can learn CNN visual features from large-scale unseen images without using any annotation, such as zero-shot /one-shot/few-shot learning.
Zero-shot learning is when features won't be available in training samples. An important theoretical basis of zeroshot learning is to use high-dimensional semantic features instead of low-dimensional features of samples, so that the trained model is transferable. Most of recent zero-shot learning works [78]- [81] learn a compatibility function between the image and semantic embedding spaces. Few-Shot / One-Shot Learning refers to small sample learning. The purpose is to overcome the problem of massive data required for training models in machine learning. It is expected that enough knowledge can be obtained with a small amount of data. The general approach is to train the model on classes with sufficient training samples and generalize to classes with few samples without learning new parameters [82]- [86]. However, it is suspected that these generated features from small sample learning cannot represent complex features well. f-VAEGAN-D2 [87] generates enough visual features utilized in any-shot learning. The goal is to infer rich features from limited data samples i.e., generate rich features from 0 shots (unseen pictures) to few shots (only a few pictures per class) to many-shots (each class has many pictures). Figure 14 shows the architecture of the f-VAEGAN-D2 model. It proposes to enhance the feature generator by combining VAEs and GANs with shared decoder and generator and adding another discriminator to distinguish real or generated features from unseen samples. To train the VAE section, Restnet101 is fed with a labeled image and outputs an embedded 2048-dim x s . This feeds to the Encoder generating the latent variable z. To this latent variable a class label is appended for the sample and fed to the Decoder/Generator. An embedded spacex s sampled from the Decoder/Generator is compared to x s to obtain a loss for the VAE.
The WGAN training utilizes sampling from a latent space N z p ∼ (0, 1) and a concatenated class label in same manner as the VAE section. This is fed into the Decoder/Generator to create fake embeddings (x s ). This fake embedding along with x s is utilized by the discriminator to distinguish between real and fake data.
Unseen images are processed by resnet101 and produce an unseen embedding x u which is utilized along a generated unseen embeddingx u in Discriminator 2. This discriminator helps train the model to recognize unseen categories, which can achieve zero-shot learning.
The final objective function of the f-VAEGAN-D2 network can be stated as following: (34) where G is the VAE decoder and the WGAN generator, D 1 , D 2 are the discriminators of both seen and unseen groups respectively. L s VAEGAN is the loss function of the VAEGAN for the seen samples and can be formulated as: where γ is a hyperparameter to control the weight of VAE loss L s VAE and the WGAN loss L s WGAN . These losses functions can be expressed as: (36) which is similar to the original VAE loss with addition to the class embedding variable c, and the WGAN loss function can be expressed as: , and λ is the penalty coefficient. Finally, the unseen WGAN loss function L n WGAN can be stated as: where againx n = αx n + (1 − α)G z p , c n with α ∼ U (0, 1), and λ is the penalty coefficient.

I. ZERO-VAE-GAN
Zero-shot learning (ZSL) is a challenging task due to the lack of unseen class data during training. Existing works attempt to establish a mapping between the visual and class spaces through a common intermediate semantic space. The main limitation of existing methods is a strong bias towards seen classes, known as the domain shift problem. This leads to unsatisfactory performance in both conventional and generalized ZSL tasks. Zero-VAE-GAN [88] tackles this challenge by converting ZSL to a conventional supervised learning by generating features for unseen classes. Zero-VAE-GAN is a joint generative model that couples variational autoencoder (VAE) and generative adversarial network (GAN). The main ideas of this model are:1) generate more seen CNN features 2) labeled unseen CNN features. The Zero-VAE-GAN model consists of four components:1) Encoder E, and 2) Generator G: by combining two generative models, the model is capable of synthesizing highquality features, 3) Discriminator D: for discriminating real features and fake generated features, 4) Categorizer C: a classifier to help the model generate more discriminative features for the classification task. The generator G and the discriminator D learn the distribution of features through a two-player minimax competition. G tries to minimize the following loss: (39) where x ∼p(x), s ∼ p(s) and N z ∼ (0, 1), p(x) and p(s) denote the prior distributions of real features and semantic embeddings, respectively. z = E(x, s) ∈ R d denotes the d-dimensional latent representation generated by the encoder E. Compared with z , z ∈ R d is the arbitrary representation drawn from a Gaussian distribution, which is used as the input for the GAN along with the semantic embedding. On the other hand, D tries to minimize the following loss: D(G(z , s)))] (40) Unlike f-VAEGAN-D2, Zero-VAE-GAN uses feedback classification probabilities generated from pretrained multilayers-perceptron (MLP) or k-nearest neighbor classifiers to generate pseudo labels for the real unseen CNN features. These classifiers are trained on the synthesized (fake) data generated from the trained generator G in the first step. The classifiers' pseudo labels probabilities are used for selftraining-refinement of the generator G to improve the generation of the features of unseen data.

J. HYPERSPHERICAL VAE
One way to improve any Bayesian model is to change the prior distribution based on the data [89]. The prior distribution does not need to have an objective basis, so it can be based in part or completely on subjective beliefs. Further, an ideal latent space should separate clusters for each class [90]. However, in normal VAEs, due to the Gaussian prior, there are limitations in the latent space, e.g., the Gaussian prior leads to improper clustering in high dimensional data, and further cannot effectively represent directional data such as spanning from protein structure [91]. Therefore, to improve the clusters in the latent space in high dimensions and learn useful representations on directional data, there is a need to replace the Gaussian prior to a prior that separates the classes over the entire latent space. One solution is to use von Mises-Fisher (vMF) for the prior.
The vMF distribution [92] refers to a continuous probability distribution model on a circle, which is also called a circular normal distribution. Some views regard it as an approximation to the wrapped normal distribution, as it is a cyclic simulation of the normal distribution. This is a normal distribution in hyperspherical space [93]. Figure 15 shows sets of points sampled from VMF distributions on the 3D sphere. Hyperspherical VAE (S-VAE) [94] uses the vMF distribution as an alternative to the Gaussian distribution. This replacement leads to a hyperspherical latent space as opposed to a hyperplanar one, where the Uniform distribution on the hypersphere is conveniently recovered as a special case of the vMF. Let z ∈ R m , then we can define the vMF distribution of latent variable as: where parameters µ and κ are called the mean direction and concentration parameter, respectively. The greater the value of κ, the higher the concentration of the distribution around the mean direction µ. The distribution is unimodal for κ >0 and is uniform on the sphere for κ = 0 where ||µ 2 || = 1. C m (κ) is the normalization constant and is equal to where J m/2−1 (κ) is modified Bessel function of the first kind at order v. The authors use a special case of KL divergence such that uniform prior is placed on the latent space.
Given Gamma function, Steifel manifold area is Von Mises-Fisher distribution is a case of Steifel manifold with radius r = 1. This is actually the surface area of the nsphere of radius 1. Thus, uniform distribution of vMF, a case where the κ = 0 is Then, in this case of KL Divergence derivation to uniform distribution, posterior is vMF = q (z|µ, κ) and prior is and Finally, the KL Divergence with vMF term KL(vMF(µ, κ))||U S m−1 to be optimized is: Since the KL term does not depend on µ, this parameter is only optimized in the reconstruction term. One difficulty is that the modified Bessel function in C m (κ) in the above expression cannot be handled by automatic differentiation packages. Thus, to optimize this term, the gradient is derived with respect to the concentration parameter κ: In the S-VAE all digits occupy the entire space. S-VAE is naturally suited to capture data with a hyperspherical latent structure, while outperforming a normal VAE, in low dimensions. However, the available spherical surface area can be limited and may collapse in higher dimensions. Figure 16 shows the visualization of latent space representation of MNIST for S-VAE. Visualization of latent space representation of MNIST for S-VAE.

A. MNIST DATASET
For all experiments, we used the Modified National Institute of Standards and Technology (MNIST) benchmark dataset. There has been extensive research on this dataset for various purposes such as image classification and generation. The dataset consists of 60,000 images for training purpose and 10,000 images for testing purpose, and both are sharing the same distribution. All images are of size 28 × 28 and the dataset contains ten label classes from [0][1][2][3][4][5][6][7][8][9]. In all experiments, Keras framework has been utilized to build all versions and variations of VAE included in this paper.

B. IMPLEMENTATION DETAILS
The hardware specifications for executing different implementations use Tesla P100 GPU with 25 GB RAM. Table 1 shows some values of hyper-parameters which are used in all experiments. For comparison purpose, some parameters for all VAE variants have been set to the same values to perform fair comparison on the MNIST dataset. All codes are available at https://github.com/VAEs-Tutorial/paper.

1) QUALITY OF THE GENERATED IMAGES
We applied VAE, INFO VAE, β-VAE, VAE-GAN, GMVAE, VaDE, VQ-VAE and S-VAE on the MNIST dataset with 5000 epochs. The generated image results are shown in Fig. 17 and Fig. 18. It is clearly seen in the figures that GMVAE generated more clear digits as compared to other VAE models. After GMVAE, β-VAE and VaDE produced better digits, but some blurriness is also present in these  digits. β-VAE and INFO VAE tend to find more latent factors consistently and learn more clearly the characterization of disentanglement than other VAE models. InfoVAE provides a good way to handle latent code ignorance issues. However, its generated images have high distortion in digits as compared to β-VAE. VaDE and GMVAE are clustering models, they can perform clustering tasks in addition to generating digits better than other VAE models. Images generated from the GMVAE have better quality than the ones generated from VaDE. VAE-GAN, VQ-VAE and S-VAE produced noise in image and digits are not very clear. In addition, some digits generated by VQ-VAE and VAE-GAN have not clear shape and edges. Images generated from the VAE-GAN have better quality than the ones generated from VQ-VAE. If we increase epochs, the quality of generated images can improve. However, we compared the results at 5000 epochs only for a consistent comparison. Table 2 shows quantitative results of all VAE variants on the MNIST dataset. The evaluation metrics are classification accuracy, loss, and computational time (seconds per epoch).

2) QUANTITATIVE RESULTS ON THE MNIST DATASET
We fixed 5000 epochs for all experiments. Table 1 shows other fixed hyperparameter values for these experiments to perform fair comparison. We use a pre-trained CNN classifier to calculate the classification accuracy of the re-generated images. Initially, the quality and accuracy of generated images is very low because generator does not know much about real data. But after various epochs, the generator starts to learn and generates more accurate images. As we can see, the highest classification accuracy is obtained by GMVAE i.e. 96.3%. The computational time (seconds per epoch) to perform this experiment is also less i.e. two seconds, which shows the efficiency of VAE. The lowest classification accuracy score is 51.87 % by VAE-GAN. Figure 19 shows comparison of classification accuracy of VAE and its variants on the MNIST dataset. If we analyze computational time, then   Table 2, then there is high fluctuation among them. Loss values do not tell much about VAE's performance as compared to other deep learning models. We are not sure when to stop training in a VAE. In VAE, the KL divergence loss and the reconstruction loss compete with each other, and improvement in one term means more loss in the other. Depending upon the diversity of the dataset, both loss terms of KL divergence and reconstruction start to converge at some point after certain number of epochs. When there is no further decrease in loss of KL divergence and reconstruction terms, it indicates training is almost complete.
The convergence in loss terms of KL divergence and reconstruction shows that the model has learned well enough and it cannot be improved further. However, the loss value may bounce around a bit and this number is not very informative. If the model is not reaching convergence, we may need to change the learning rate or other hyper-parameters. Till now, there is no proper evaluation metric for VAEs. Sometimes good qualitative results may have less accuracy. Also, there are different loss functions used in VAEs making it difficult to perform fair comparison among them. Therefore, evaluation metric and training procedure should be chosen according to desired application or task, e.g. a variety of disentanglement evaluation protocols have been proposed leveraging the statistical relations between the learned representation and the ground-truth factor of variations [95]. Also, a VAE model's good performance in one domain does not necessarily mean good performance in another domain.

V. COMPARATIVE EVALUATION OF VAES
We have introduced the most significant problems present in the original VAE design, which are 1) blurry outputs, 2) latent representations are not interpretable, 3) Gaussian distribution as priori has limitations because the representations that are learned can only be unimodal and do not allow for more complex features, 4) the Gaussian definition is based on the L 2 -norm that suffers from the curse of dimensionality. We have surveyed significant VAE-variants that remedy these  [95]. β-VAE models cannot effectively tradeoff weighing of X and Z and information preferences and is also encounters under-fitting or ignoring a subset of the latent variables.

C. PRIOR-VARIANT
One way to improve any Bayesian model is to change the prior distribution based on the data. VaDE and GMVAE adds clustering through imposition of a GMM priori on VAEs. The number of clusters can be set to the number of classes in each dataset. However, they both have problems of VOLUME 8, 2020 high-computational complexity and their generated image quality is low. Moreover, VaDE has no specified stability for differ ent settings of the number of clusters, K . GMVAE is slightly more complicated than VaDE, and it is computationally expensive because of the need to sample an additional latent space (w).
VQ-VAE proposes combining VAEs with discrete latent representation by imposing a vector quantization algorithm on the latent space. It can learn useful and discrete representations automatically as well as abstract away noise and details. However, VQ-VAE has the problem of a complex sampling procedure and is unstable in challenging datasets (i.e., high entropy). S-VAE utilizes spherical latent representation by replacing Gaussian distribution priori in the classical VAE with von Mises-Fisher (vMF) distribution. It can utilize the hyperspherical space to separate clusters for each class without forcing its mean to be close to the center.
Among the different approaches surveyed in this work, it was shown that variations of the VAEs can improve the generated image quality and their diversity. It has been indicated in [99] that the capacity and performance of VAEs are related to the network size and batch, which follows that a well-designed architecture is critical for good VAE performance. However, modifications to the architecture alone do not fully improve generation of data. Redesign of the loss function including regularization and normalization can help improve effective reconstruction for VAEs. In addition, replacing the Gaussian prior can improve the VAE model to learn appropriate latent representations.
There are other types of VAEs that have been introduced but are not frequently used in applications. For example, Conditional VAE (CVAE) [96], which is similar to Conditional GAN, where a control vector ''c'' is used as an input with the data ''x'', as well as the latent variables ''z'', to be a part of the VAE structure. In most applications of this type of VAE, the label data is used as this control variable. Moreover, other types of VAE are introduced where the KL distribution similarity measurement metric is not used, and other metrics are utilized. As an example of these types, the Wass Wasserstein VAE (WAE) [97], where Wasserstein distance is used instead of the KL term in the loss function to measure the similarity between the model distribution and the target distribution.S3VAE [98] learns disentangled timeinvariant and time-varying representations for sequential data (e.g., videos and audios) under self-supervision. This makes it possible for sequential data generation, high-resolution video generation, video prediction and image-to-video generation.
There is no single VAE design that can be claimed to be the best. The choice of a specific VAE type depends on the application. For instance, if an application requires the sampling of different classes in the latent space, there is a need of clustering. VaDE, GMVAE, S-VAE can be good choices here. S-VAE can do a better job on directional data compared to the other two. If an application requires production of enough high-quality images (requiring generation of images which are very diverse), VAE-GAN can be good choices here. If an application requires production of enough CNN visual images for few-shot/zero-shot learning (requiring generation of images which are very diverse), f-VAEGAN-D2/ Zero-VAE-GAN can be a good choice. If there is a need to learn useful latent disentangled representations automatically, in order to create more attributes of the image to further improve the classification problems, INFO VAE and β-VAE can be good choices. Table 3 summarizes the different variants along with their description and variation type. Table 4 lists the pros and cons of different VAE designs discussed in this paper. Figure 20. shows variation type of VAEs.

VI. FUTURE OPPORTUNITIES AND CONCLUSIONS
VAEs and its variations have played a very important role in unsupervised data generation (especially in image generation), deep clustering and representation learning.
The improvement in the variations of the VAEs can be summarized into three perspectives: 1) Architecture: By enhancing the network design with other architectures, VAEs can improve the image quality, e.g., VAEs combined with GANs can decrease the blurred effect of the image.
2) Posterior distribution: Regularization of the posterior distribution can be used to boost the disentangled features, e.g., β-VAE and Info-VAE provide disentanglement and hierarchical organization of features.
3) Structured prior distribution: VAE variants can also introduce a structured prior distribution such as imposing a GMM priori (GMVAE, VaDE), and Vector Quantization (VQ) on VAEs (VQ-VAE). These accomplish better clustering and representation of data.
Based on the analyses and the comparative evaluation provided in this paper, we believe that understanding the VAE model from the perspective of variational optimization and information theory will be important research trends in the near future. We summarize some of the potential areas of research in the VAE field as: 1) Enhancing the VAE model by improving the variational optimization of the latent variable space, thereby avoiding or minimizing the limitations of the existing methods. This can make learning more meaningful by providing valuable information in the variational inference process. Many research opportunities have not been explored in the intersection of these methods, i.e., integrating regularization-based methods while bringing in structured priors.
2) Separation of information: By weakening the dependencies between non-associative features, disentanglement and generation capabilities of VAEs will be greatly improved. By calculating the mutual information index between the various influence factors in VAEs, the model can potentially discriminate influencing factors from non-descriptive ones.
3) Disentanglement learning: It is unclear whether the solution of disentangled representations is useful for downstream tasks. Therefore, future research on disentangled representations learning should consider the role of inductive biases supervision. 4) Posterior collapse: Posterior collapse in VAEs arises when the variational posterior distribution closely matches the prior for a subset of latent variables [100]. Conventional wisdom largely assigns blame for this phenomenon on the undue influence of KL-divergence regularization. Although there is now a vast literature on the various potential causes of posterior collapse, there remains ambiguity as to exactly what is this phenomena [101]. Therefore, more significant progress towards understanding the causes of posterior collapse is needed.
These proposed enhancements will improve the ability to generate meaningful artificial data. This data can be used for representation learning or to improve the classification in deep networks where currently there is not enough training data, or a particular class is underrepresented.
We have provided a comprehensive insight, and a comparative evaluation summary of the variations in VAEs so that researchers can grasp the fundamental theory as well as the intuition behind the variations on VAEs. Further, we have provided reference implementations for the different VAE variations on github. We hope that this will be useful in improving the state of the art leading to research breakthroughs in related fields.