Evaluation Metrics for Generative Models: An Empirical Study

: Generative models such as generative adversarial networks, diffusion models, and variational auto-encoders have become prevalent in recent years. While it is true that these models have shown remarkable results, evaluating their performance is challenging. This issue is of vital importance to push research forward and identify meaningful gains from random noise. Currently, heuristic metrics such as the inception score (IS) and Fréchet inception distance (FID) are the most common evaluation metrics, but what they measure is not entirely clear. Additionally, there are questions regarding how meaningful their score actually is. In this work, we propose a novel evaluation protocol for likelihood-based generative models, based on generating a high-quality synthetic dataset on which we can estimate classical metrics for comparison. This new scheme harnesses the advantages of knowing the underlying likelihood values of the data by measuring the divergence between the model-generated data and the synthetic dataset. Our study shows that while FID and IS correlate with several f-divergences, their ranking of close models can vary considerably, making them problematic when used for fine-grained comparison. We further use this experimental setting to study which evaluation metric best correlates with our probabilistic metrics.


Introduction
Implicit generative models such as generative adversarial networks (GANs) [1] have made significant progress in recent years, and are capable of generating high-quality images [2,3] and audio [4].Despite these successes, evaluation is still a major challenge for implicit models that do not predict likelihood values.While significant improvement can easily be observed visually, at least for images, an empirical measure is required as an objective criterion and for comparison between relatively similar models.Moreover, devising objective criteria is vital for development, where one must choose between several design choices, hyper-parameters, etc.In light of the epistemological challenges presented in [5], on their study on the limits of knowledge, it becomes imperative to explore how these constraints impact the evaluation and understanding of generative models.The most common practice is to use metrics such as the inception score (IS) [6] and Fréchet inception distance (FID) [7] that are based on features and scores computed using a network pre-trained on the ImageNet [8] dataset.While these have proved to be valuable tools, they have some key limitations: (i) It is unclear how they relate to any classical metrics on probabilistic spaces.(ii) These metrics are based on features and classification scores trained on a certain dataset and image size, and it is not clear how well they transfer to other image types, e.g., human faces, and image sizes.(iii) The scores can heavily depend on particular implementation details [9,10].
Another evaluation tool is querying humans.One can ask multiple human annotators to classify an image as real or fake or to state which of two images they prefer.While this metric directly measures what we commonly care about in most applications, it requires a costly and time-consuming evaluation phase.Another issue with this metric is that it does not measure diversity, as returning a single good output can obtain a good score.
In this article, we offer a new evaluation protocol for likelihood-based generative models such as autoregressive (AR) and variational auto-encoders (VAEs) [11].We created a high-quality synthetic dataset, using the powerful Image-GPT model [12].This is a complex synthetic data distribution that we can sample from and compute exact likelihood values.As this data distribution is trained on natural images from the ImageNet dataset using a strong model, we expect the findings on it to be relevant to models trained on real images.The dataset provides a solid and useful test-bed for developing and experimenting with generative models.We will make our dataset public for further research (https://github.com/eyalbetzalel/notimagenet32,accessed on 25 May 2024).
Using this test-bed we train various likelihood models and evaluate their KL-divergence and reverse KL-divergence.While our interest is implicit models, we experiment with likelihood models as they have alternative well-understood metrics for comparison.This allows us to compare the well-understood divergences to empirical metrics such as FID and evaluate their capabilities.We expect our results to transfer to implicit models as metrics such as FID and IS are not tailored to a specific kind of model.We observe that while the empirical metrics correlate nicely to these divergences, they are much more volatile, and thus, might not be well suited for fine-grained comparison.
To better structure our investigation, and to clarify the scope of this study, we have delineated specific research questions and compiled the key findings that emerged from our experimental work.These elements are summarized below, highlighting both the focus of our research and the implications of our results:
How do empirical metrics like the inception score (IS) and Fréchet inception distance (FID) compare with probabilistic f-divergences such as KL and RKL in evaluating generative models?2.
What limitations exist in using popular metrics like IS and FID for model evaluation across diverse datasets and model types?3.
Can a synthetic dataset provide a controlled environment to better evaluate and understand these metrics? 1.
Empirical metrics, while commonly used, exhibit considerable volatility and do not always align with f-divergence measures.

2.
Inception features, although useful, show limitations when applied outside of the ImageNet dataset, impacting the reliability of IS and FID.

3.
The introduction of a high-quality synthetic dataset, NotImageNet32, helps in evaluating these metrics more consistently, offering a new pathway for robust generative model assessment.

Background
Given the popularity of GANs and other implicit generative models, many heuristic evaluation metrics have been proposed in recent years.We give a quick overview of the most common metrics and probabilistic KL-divergences.

KL-Divergence
One common measure of the difference between probability distributions is the Kullback-Leibler (KL) divergence KL(p||q) = E x∼p log p(x) q(x) ; noting that it is not symmetric.We refer to KL(p data ||p model ) as the KL-divergence and KL(p model ||p data ) as the reverse KL (RKL)-divergence, where p data denotes the real data distribution, and p model denotes the approximated distribution, learned by the generative model.Minimizing the log-likelihood is the same as minimizing the KL-divergence between p data and p model up to a constant, hence it can be performed even when p data is unknown.It is important to note that the KL-divergence is biased towards "inclusive" models, where the model "covers" all high-likelihood areas of the data distribution and punishes harder when p data (x) ≫ p model (x) (Figure 1, left).The RKL has a bias toward "exclusive" models, where the model does not cover low-likelihood areas of the data distribution and punishes harder when p data (x) ≪ p model (x) (Figure 1, right).While an exclusive bias might be more appropriate in some applications, such as out-of-distribution detection, we cannot optimize it directly without access to p data .As these divergences measure complementary aspects, we believe that examining both of them simultaneously gives us a well-rounded view of the generative model behavior.A limitation of KL-divergence is that it does not consider the metric properties of the sample space, as opposed to Wasserstein distance; therefore, it is less suitable for GAN training since it uses samples directly in the training process [13].
Optimizing p model with KL criteria pushes the model to cover all aspects of p data , hence it is more exclusive, while optimizing it with reverse KL criteria encourages the model to cover the area with the largest probability, hence it is more inclusive.

Inception Score
Inception score (IS) is a metric for evaluating the quality of image generative models based on the InceptionV3 network pre-trained on ImageNet.It calculates where x ∼ p model is a generated image, p θ (y|x) is the conditional class distribution computed via the inception network, and p θ (y) = x p θ (y|x)p model (x)dx is the marginal class distribution.The two desired qualities that this metric aims to capture are (i) The generative model should output a diverse set of images from all the different classes in ImageNet, i.e., p θ (y) should be uniform.(ii) The images generated should contain clear objects so the predicted probabilities p θ (y|x) should be close to a one-hot vector and have low entropy.When both of this qualities are satisfied, then the KL distance between p θ (y) and p θ (y|x) is maximized.Therefore, the higher the IS is the better.

Fréchet Inception Distance
The FID metric is based on the assumption that the features computed by a pre-trained inception network, for both real and generated images, have a Gaussian distribution.We can then use known metrics for Gaussians as our distance metric.Specifically, FID uses the Fréchet distance between two multivariate Gaussians, which has a closed-form formula.For both real and generated images we fit Gaussian distributions to the features extracted by the inception network at the pool3 layer and compute where N (µ r , Σ r ) and N (µ g , Σ g ) are the Gaussians fitted to the real and generated data, respectively.The quality of this metric depends on the features returned by the inception net, how informative are they about the image quality, and how reasonable is the Gaussian assumption about them.

Kernel Inception Distance
The kernel inception distance (KID) [14] aims to improve on FID by relaxing the Gaussian assumption.KID measures the squared maximum mean discrepancy (MMD) between the inception representations of the real and generated samples using a polynomial kernel.This is a non-parametric test so it does not have the strict Gaussian assumption, only assuming that the kernel is a good similarity measure.It also requires fewer samples as we do not need to fit the quadratic covariance matrix.The motivation for this is the bias of the FID and IS.

FID ∞ , IS ∞ , and Clean FID
In [15], the authors show that the FID and IS metrics are biased when they are estimated from samples and that this bias depends on the model.As the bias is model-dependent, it can skew the comparison between different models.The authors then propose unbiased versions of FID and IS named FID ∞ / IS ∞ .As the input to the inception network is fixedsize, generated images of different sizes need to be resized to fit the network's desired input dimension.The work in [16] investigates the effect of this resizing on the FID score, as the resizing can cause aliasing artifacts.The lack of consistency in the processing method can lead to different FID scores, regardless of the generative model capabilities.They introduce a unified process that has the best performance in terms of image processing quality and provide a public framework for evaluation.

Ranking Correlation Methods
To compare the different scoring methods, we evaluate how they differ in ranking different models.This allows us to focus on their main purpose of ranking different models.For this we will use ranking correlation metrics.

Spearman Correlation
The Spearman correlation coefficient is defined as the Pearson correlation coefficient between the rank variables.For n elements being ranked, the raw scores X i , Y i are converted to ranks R(X i ), R(Y i ).The Spearman correlation coefficient r s is defined as ρ denotes the usual Pearson correlation coefficient, but applied to the rank variables, cov(R(X), R(Y)) is the covariance of the rank variables, σ R(X) and σ R(Y) are the standard deviations of the rank variables.

Kendall's τ
Kendall's [17] correlation coefficient assesses the strength of association between pairs of observations based on the patterns of concordance and discordance between them.A consistent order (concordance) is when x 2 − x 1 and y 2 − y 1 have the same sign.
Inconsistently order (discordant) occurs when a pair of observations is concordant if , where C is the number of concordance pairs in the list and DC is the number of discordant pairs.

Related Works
In addition to previously mentioned works that defined empirical metrics, other works looked into the evaluation of generative models.Bond-Taylor et al. [18] performed a comparative review of deep generative models.Borji [19] provides an extensive overview of methods for estimating generative models.Theis et al. [20] examine likelihood-based models and demonstrate through toy examples the independence of various evaluation methods.We endorse this view and conduct an in-depth empirical analysis using real datasets to compare contemporary evaluation techniques for generative models.Ref. [9] first pointed out issues in IS.Ref. [21] inspects the distribution of the inception latent feature and suggests a more accurate model for evaluation purposes.Ref. [22] performs an empirical study on an older class of evaluation metrics of GANs and mentions that KID outperforms FID and IS.Ref. [23] shows IS's high sensitivity to the dataset trained by the backbone network (in this example, ImageNet and CIFAR-10).Ref. [24] shows FID's sensitivity to layers and features of the backbone network and mode dropping.
Another line of works [25][26][27][28] utilize the classification score of generated data to evaluate models' performances.Despite its usefulness, a classification score is not foolproof.During adversarial attacks, for example, the image may appear perfect, but its classification score will be poor.
The latest works propose precision and recall as a way to disentangle the quality of generated samples from the coverage of the target distribution [29,30].

Method
As the first step of our method, we train an autoregressive model to approximate the information distribution.Using the model, whose distribution we know, we create a highquality synthetic dataset, and then, examine the performance of other likelihood-based models against the synthetic data.The following Algorithm 1 and Figure 2   In this article, we created an auxiliary realistic dataset by sampling images from the Image-GPT model that has been trained on ImageNet32: the ImageNet dataset that was resized to 32 × 32.Image-GPT was chosen as a reference for being a powerful AR model with 1 M epochs of training checkpoints available (https://github.com/openai/image-gpt,accessed on 25 May 2024).We split the dataset into a training set (70 K images) and a test set (30 K images), similar in size to CIFAR10, a common benchmark.Image-GPT's ability to generate quality and realistic samples is demonstrated qualitatively in Figure 3 and quantitatively by the high results in linear probability scores.As this is a synthetic version of ImageNet32 we name our dataset NotImageNet32.We note that Image-GPT clusters the RGB values of each pixel into 512 clusters and predicts these cluster indexes.This means that instead of each pixel corresponding to an element of {0, . . ., 255} 3 it belongs to {0, . . ., 511}.We can map these cluster values back to RGB, as in Image-GPT, for visualization.This scheme is not restricted to NotImageNet32, which is used as an example for a single use case.In general, we advocate for using high-quality synthetic datasets to bridge the gap between real data on which performance is hard to evaluate and toy problems that do not necessarily represent real challenges.This can be utilized for ranking state-of-the-art (SOTA) generative models and finding hyper-parameters of the data generation process such that they produce the least amount of inconsistencies across measurements.
To evaluate and understand current heuristic generative model metrics we train a set of models on NotImageNet32.One set of models is based on the PixelSnail model [31].We use PixelSnail as it is a strong autoregressive model, but not as powerful as the pixel-GPT that generated the data.From this, we expect it to be able to fit the data well, but not perfectly.For diversity, we also measure a VAE model, based on VD-VAE [32] (we use IWAE [33] to reduce the gap between the ELBO and the actual likelihood).We note that all models were adjusted to our dataset and output the clustered index instead of RGB values.Supplementary details on the models architecture in this experiment can be found in the Appendixes A-D section.
To produce a diverse set of models with varying degrees of quality, each set was trained several times with different model sizes.We save a model for comparison after every five epochs of training.As a result, the models we compare are a mix of strong and weak models.After the training procedure, we can compute for each image in the test set its likelihood score (or the IWAE bound) for each model.
We then measure the difference between p data (x) and p model (x) by using Monte Carlo approximation of two divergence function: Kullback-Leibler (KL) KL(p data ||p model ) and reverse KL (RKL) KL(p model ||p data ).As these divergences measure complementary aspects, one inclusive and one exclusive, we believe that this, although unable to capture all the complexities of a generative model, gives us a well-rounded view of the generative model behavior.KL-divergence has been thoroughly investigated in the fields of probability and information theory, and its properties along with what it measures are well known.Thus, comparing it to heuristic methods such as FID will shed light on these empirical methods.
A limitation of this test-bed is that it can be applied only to likelihood-based models, so implicit models like GAN are not able to take advantage of it.

Volatility
We first train four PixelSnail variants on our NotImageNet32 dataset and plot the KL, RKL, FID, and IS (we plot the negative IS, so lower is better for all metrics) along with the training for test set in Figures 4 and 5.It can easily be seen that after 15-20 epochs both KL and RKL change slowly, but the FID and IS are much more volatile.Each dot in the graph represents a score that has been measured on a different epoch on a different model.To assess the variance in the results we used the jackknife resampling method [34].The error bars are small (10 −3 scale in most cases), hence they are unnoticeable.One can see from this figure that as we increase the model capacity, the KL score improves.Model-generated samples are included in Appendix D. Interestingly, the KL and RKL have a high agreement, even if they penalize very different mistakes in the model.In stark contrast, we see that the FID, and especially IS, are much more volatile and can give very different scores to models that have very similar KL and RKL scores.For another perspective, we plot in Figure 6 the FID and negative IS vs. KL and RKL.We observe a high correlation between FID/IS and KL and a weaker correlation between these metrics and the RKL.IS and FID also seem ill-suited for fine-grained comparisons between models.For high-quality models, e.g., light-blue dots in Figure 6, one can obtain a significant change in FID/IS without a significant change to KL/RKL.This can be very problematic, as when comparing similar models, e.g., testing various design choices, these metrics can imply significant improvement even when it is not seen in our probabilistic metrics.We add zoomed-in versions of this plot to Appendix A for greater clarity.

Ranking Correlation
To better quantitatively assess our previous observations, we compare how the metrics differ in their ranking of the various trained models.This is of great importance, as comparing different models is the primary goal of these metrics.To compare the ranking we compute Kendall's τ ranking correlation (Table 1).We perform the correlation analysis for models that were trained for 15-45 epochs and ignore the first iterations of the training procedure.This is to focus more on the fine-grained comparisons.The highest score in both ranking correlation methods is between KL and reverse KL with 0.889 Kendall's τ (in bold).This may be surprising since these two methods measure different characteristics of the data.Confirming our previous observation, the FID and IS ranking scores are low, with FID outperforming IS.However, the extensions of FID do achieve better scores.
Another observation is the relatively low correlation between many of the different rankings.All of the inception ranking correlations, except one (KID and clean FID), indicate that one can obtain significantly different rankings by using a different metric.
Among the inception-based metrics, FID ∞ has the highest correlation with KL and RKL, which indicates that it is a more reliable metric than the others.IS/IS ∞ has the lowest ranking correlation of all the other models.

Discussion
This study contributes to the evolving field of generative model evaluation by introducing a novel evaluation protocol that utilizes a high-quality synthetic dataset, NotIma-geNet32, to compare probabilistic f-divergences like KL and RKL with empirical metrics such as FID and IS.Our findings indicate that while empirical metrics like FID and IS are widely used and correlate with some aspects of model performance, they exhibit considerable volatility and do not always align with changes observed in f-divergence metrics.This discrepancy underscores the complexities and potential limitations of using single metrics for model evaluation.

Comparison with Existing Literature
Our results align with previous studies that have critiqued the reliability of popular metrics like IS and FID, particularly in terms of their consistency and ability to generalize across different datasets and model types.For instance, the use of inception features has been shown to perform variably across non-ImageNet benchmarks, suggesting a need for more versatile and robust evaluation tools.Our study extends this narrative by demonstrating similar volatility and recommending the adoption of newer metrics like FID ∞ and the exploration of multiple metrics to provide a more comprehensive evaluation.

Implications of Findings
The observed volatility in empirical metrics, especially in high-stakes areas like generative model deployment in medical imaging or autonomous driving, could lead to misguided conclusions about model performance.By advocating for a combination of metrics and the introduction of a synthetic dataset as a standardized test-bed, our study proposes a pathway towards more reliable and interpretable evaluations.This approach could help mitigate risks associated with deploying under-evaluated or overestimated models in critical applications.

Limitations
The primary limitation of this study is its reliance on a single synthetic dataset, NotIm-ageNet32, which, while providing a controlled environment for model evaluation, may not capture the diversity and complexity of real-world data.Additionally, our conclusions are based on the performance of likelihood-based generative models, which may not directly translate to implicit models such as GANs and diffusion models.

Future Research Directions
Future studies should aim to replicate and expand upon our findings by incorporating multiple synthetic and real-world datasets to assess the generalizability of the proposed metrics.Further research should also explore the development and validation of new metrics that can capture a broader range of model behaviors and better reflect real-world performance.Additionally, exploring the integration of human perceptual studies could provide a complementary perspective to purely computational metrics, offering a holistic view of model effectiveness.

Conclusions
We generated a high-quality synthetic dataset and compared standard empirical metrics, such as FID and IS, with probabilistic f-divergences like KL and RKL.Our analysis shows that although the empirical metrics generally correlate well and capture important trends, they demonstrate considerable volatility.Not all observed improvements in these metrics correspond to similar gains in KL-or RKL-divergences.Additionally, the inception score and its IS ∞ extensions tended to perform less effectively compared to other metrics.
Given the outcomes of our study and acknowledging that our analysis is based on a single synthetic dataset, we suggest the following cautious approaches for future research and application:

•
Consider phasing out the inception score, favoring FID ∞ for its reduced volatility.• Employ a combination of metrics (such as FID ∞ , KID, and clean FID) to help manage metric volatility and provide a more robust evaluation.

•
Explore using NotImageNet32 as a potential test-bed for likelihood-based generative models to further assess its efficacy across various generative modeling scenarios.
the residual blocks and attention blocks mentioned earlier.We used Adam optimizer with LR 0.0001 and multiplicative LR scheduler with lambada LR 0.999977.The loss function changed to the mean cross-entropy over 512 discrete clusters.All the other parameters that make up the model are described in Table A1.

Appendix B.2. VD-VAE
The VD-VAE network is built from an encoder and decoder.In the encoder, there are regular blocks, which receive an input and output an output with the same dimension, and down-rate blocks that receive input and output an output with a lower dimension.The difference between these two blocks is an avg_pool2d at the end of the down-rate block.In the decoder, there are regular blocks and mixin blocks.The regular blocks receive an input and output an output with the same dimension.The input is fed from the previous layer and the parallel layer in the encoder.The mixin block performs interpolation to a higher dimension.In Table A2, × means how many regular blocks are concatenated in a row.For example, 32 × 10 means 10 blocks in a row with a 32-channel input.d means a down-rate block.The following number is the factor of the pooling.m means an unpool (mixin) block, for example, 32m16 means 32 is the output dimensionality with 16 layers in the mixin block.
Other hyper-parameters that were changed include the EMA rate, to 0.999, warm-up iterations, to 1, learning rate, to 0.00005, grad clip, to 200, and skip threshold, to 300.We used the Adam optimizer with β 1 = 0.9 and β 2 = 0.9.Other hyper-parameters were configured as mentioned in the VD-VAE article.
are the steps involved in the method: Algorithm 1 Creating Synthetic Dataset With Known Likelihood 1: Train likelihood-based generative model 1 on dataset X 2: Generate X, N samples from p data (x) with known likelihood 3: Split X to train set and test set 4: Train likelihood-based generative model 2 with the train set 5: Evaluate p model ( X) on test set from model 2 6: Measure KL(p data ( X)||p model ( X)) and KL(p model ( X)||p data ( X)) on test set

Figure 2 .
Figure 2. Illustration: X are ImageNet images; X are synthetic images sampled from image-GPT; p data ( X) is ground truth likelihood from image-GPT for synthetic images; and p model ( X) is likelihood estimation of p data ( X), calculated by the evaluated model, in this case, PixelSnail.

Figure 3 .
Figure 3. Examples of photos that are generated by image-GPT.Each photo's explicit likelihood can be measured.

Figure 4 .
Figure 4. Test of KL and RKL of PixelSnail models through training.

Figure 5 .
Figure 5. Test of FID and negative IS of PixelSnail models through training.We plot the negative inception score, so lower is better for all metrics.Details on the hyper-parameters summarized in the legend are in the Appendix B.

Figure 6 .
Figure 6.Evaluation metrics through the training of four PixelSnail and two VD-VAE models of varying sizes.(a) FID vs. KL, (b) -IS vs. KL, (c) FID vs. RKL, (d) -IS vs. RKL.We plot the negative inception score, so lower is better for all metrics.

Figure A2 .
Figure A2.NLL score on the training set for different PixelSnail models on NotImageNet32.

Table 1 .
Kendall's τ correlation between different metrics.A correlation score indicates the degree of agreement between two scoring methods.