A survey on GANs for computer vision: Recent research, analysis and taxonomy

In the last few years, there have been several revolutions in the field of deep learning, mainly headlined by the large impact of Generative Adversarial Networks (GANs). GANs not only provide an unique architecture when defining their models, but also generate incredible results which have had a direct impact on society. Due to the significant improvements and new areas of research that GANs have brought, the community is constantly coming up with new researches that make it almost impossible to keep up with the times. Our survey aims to provide a general overview of GANs, showing the latest architectures, optimizations of the loss functions, validation metrics and application areas of the most widely recognized variants. The efficiency of the different variants of the model architecture will be evaluated, as well as showing the best application area; as a vital part of the process, the different metrics for evaluating the performance of GANs and the frequently used loss functions will be analyzed. The final objective of this survey is to provide a summary of the evolution and performance of the GANs which are having better results to guide future researchers in the field.


Introduction
Generative Adversarial Networks (GANs) are specific Artificial Neural Networks (ANNs) architectures that were introduced in 2014 by Ian GoodFellow [51]. GANs are a type of generative models based on game theory where ANNs are used to mimic a data distribution. Since they were firstly introduced, GANs have supposed a large change in the synthesized data generated by Artificial Intelligence (AI).
Due to their success, the number of GAN related researches has increased exponentially [29]. These researches have focused on different aspects of the models, from optimizing their training [77,55] to applying GAN to new fields such as language generation [190], image generation [79,77], image-to-image translation [211,67], image generation in text description [212], video generation [98], and other domains [80] achieving state-of-the-art results.
GAN models are capable of replicating a data distribution and generating synthesized data, applying a certain standard deviation to create new and never seen before data. Due to the particularities of GANs, one of the fields were they have supposed a change in the quality of the synthesized data is in computer vision. Although there were previous models [1,8,173], GANs have shown to generate sharper results [161].
The main peculiarity of GANs lies in their training, where it is based on game theory, where two neural networks compete in a min-max game. Both networks must optimize their corresponding objective functions, generating a situation where two players compete for opposites objectives. Fig. 1 shows how the GAN architecture is composed. Due to this architecture complexity, GANs suffer from instability during their training [185,169,116].
The instability of training in these models gives rise to problems such as mode collapse, so that researches have been made to tackle this kind of problems [16,7,2,12,42]. As [168] defines, mode collapse happens when the GANs model generates the same class outputs with different inputs.
Because of the considerable variety of fields in which GANs are applied [3], the variety of different GAN architectures is wide [211,67,5]. This research focuses on outlining the fields where GANs have achieved better results. We will review the different GAN architectures that exist, how they are structured, and how they are adapted to fulfill the particularities of each problem.
Although we will explain different GAN architectures, it should be noted that, when new GAN models are created, they usually combine the different results of previous researches. Most of the models that we will present are usually overlapped to achieve better results.
GAN surveys are usually focused on GAN models structure [185,48] or their application in certain tasks [180,4]. Because we will focus on novel GAN architectures, this survey can be identified as the first type. Nevertheless, in the final steps of this survey, we will review how different GAN architectures are applied to real world problems.
This survey focuses on contextualizing the recent progress in the GAN field, reviewing the different variants that have been lately presented and how they address the main problems of training GANs. We provide a complete view of the GAN structure and particularities, then we contextualize the main problems that the networks suffer. We also summarize how GAN performance is measured, explaining the most used metrics that researchers use. During the different sections we outline how the presented architectures treat the differ-ent problems that we have characterized. Finally we propose a classification of GANs based on their application, for each class we review the progress that the main variants have followed and we compare their results.

Related Work
Several other surveys of GANs published during the last years [134,177,48,152,187] have been studied to investigate the recent trends. For example, [185] focus on the instability issues that GANs suffer and show different ways to minimize it. The results suggest that some novel architectures try to control GAN's training, while this control can be achieved by focusing on tuning hyperparameters. It also emphasizes that much of the theoretical work does not fulfill in reality, which causes some GANs to convergence when they should not and not converge when they should.
Few surveys have been conducted to explore several approaches to optimize the loss function of GAN. This research approach tries to enhance the similarity between original and synthesized data distributions by defining an appropriate loss function. Surveys such as [133] are focus on analyzing the state-of-the-art GANs and further analyzing the performance of a huge variety of networks. In addition, they propose a set of recommendations of which loss function works best for each case of use.
Other works focusing on the applications of GANs instead of their composition or loss function. For example, [53] focus on how different GAN's architectures have been used during the last years for different problems, while [180] shows the different architectures for computer vision and their applications.
Due to the constant evolution of GANs during the last few years, these reviews are outdated almost instantaneously. As a result of some relevant and recent researches like [80,200,106] cannot be found in any recent GAN review [3,34]. We consider that a new and more complete review must be done, covering the researches that previous reviews did not fill in and contributing to a deeper and more thorough analysis of the state-of-the-art of GANs.

Structure of this survey
This survey is structured as follows. Section 4 is a concise introduction of GAN composition and principles, we will also summarize the common problems that GANs suffer to then review the different solutions proposed to each problem. The different evaluation metrics are also reviewed, we address each metric strengths and weaknesses.

Generative Adversarial Networks (GANs)
In this section, we will review the basic characteristics of GANs, their structure, composition, and common problems. We will especially focus on GAN problems because most of the GAN architectures [119,160] are created to minimize the training problems.

Definition and structure
GANs are an architecture composed of various neural networks, their objective is to replicate a data distribution in an unsupervised way. To achieve it, they are composed of two neural networks that play a two-player zero-sum game. In this game, the network called the Generator (G) is in charge of creating new data samples replicating, but not copying, the origin data distribution; while the Discriminator (D) tries to distinguish real and generated data.
From a formal point of view, D estimates p(y|x), that is, the probability of a label y given the sample x; while G generates a sample given a latent space z, which can be denoted as G(z).
This process consists in both networks competing. While G tries to generate more realistic results, D improves its accuracy detecting which samples are real and which not. In this process, both competitors are synchronized, if G creates a better output, it will be more difficult for D to differentiate them. On the other hand, if D is more precise, it will be more difficult for G to fool D. This process is a minimax game in which D tries to maximize the accuracy and G tries to minimize it. The formulation of the minimax game loss function can be denoted as: where x ∼ pr is the distribution of the real data and z ∼ pz denotes the probability distribution of the latent space of G. z ∼ pz is commonly a Gaussian or uniform noise that G uses to model new samples of data denoted as G(z). D function is to differentiate between the real distribution D(x) and the synthesized distribution According to the equation, the initial publication where the GANs where presented [51] proved the existence of a unique solution. This solution is called Nash Equilibrium (NE) and it happens when neither player can improve their loss [125].
Several researches have demonstrated that reaching the NE might not be possible in practice [43,61] or the unique solution [49].

Common problems
Due to GAN's particularities previously described, there are some aspects in GAN's training [150] to which special attention should be given.
In addition of summarizing the different main GANs problems, during the section 5 we will connect the different GAN architectures with the problems that they tackle. It should be noted that the recent proposed architectures tries to minimize the different GAN issues to optimize their models.

Mode collapse
The objective is to generate synthesized data from a latent space, which requires not only quality in the generated data, but generalization and diversity in the different synthesized samples. In other words, GAN models should be able to recreate new unseen data. Mode collapse occurs when the same class outputs are generated by different inputs from the latent space [207].
There are studies [2] that shows how the quality and diversity of GANs are correlated. Many efforts [120,7,96] have been taken to tackle mode collapse, but it is still an open problem.
In practice, it is not common for GAN's model to generate always the same output with different inputs [50], this issue is known as complete mode collapse. This type of error occurs rarely, however, it is a common problem that occurs in a partial form or

Gradient vanishing
GAN's training must be balanced, both G and D need to be synchronized to learn together progressively [159,207]. A very accurate D is capable to differentiate between the real and synthesized data, this can be denoted as D(x) = 1 and D(G(z)) = 0.
The loss function in this case approaches to zero, generating gradients close to zero and providing little feedback to the G. On the other hand, a poorly accurate D cannot differentiate between real and synthesized data, providing to G useless information.

Instability
Due to the particularity of GANs, the combination of two models learning from each other is a complex task. GAN training is based on a zero-sum game where both networks compete to find its particular solution, playing a minimax game.
This architecture of models is based on cooperation to optimize the global loss function, but the problems that D and G must optimize are opposite. Due to the particularity of the objective function of the networks, there can be times during the training where a small change in one of the networks can lead to a big change in the other, in turn producing further changes. Those intervals in which both networks start to desynchronize their states are very delicate since large changes in the gradients can lead to a network losing its learning [5,213].
It should be noted that instability periods tend to generate more instability, making the problem last longer. Networks can reverse the instability process, but even if it happens, it will cost the training performance.
Many of the last proposed GAN architectures are focused on stabilize their training [77,5]. By stabilizing the training, it is usually achieved a better performance of the networks, this is why most of the last progress involve a more stable training.

Stopping problem
Traditional neural networks have to optimize a loss function decreasing monotonically, in theory, the cost function. Due to the minimax game that GANs have to optimize, this does not happen to them [49,107,10]. In a GAN training, the loss function does not follow any pattern, so it is not possible to know the state of the networks by their loss function. This causes that, when a training is occurring, it is not possible to know when the models have been fully optimized.

Evaluation metrics
Due to GAN's particularity, there is not an unique metric to measure the quality of the synthesized data [190]. One of the reasons of why there is no consensus among researches is the particularity of each GAN application. As mentioned in previous sections, GANs can be used to replicate any data distribution, but it depends on the particular problem how to measure the differences between the origin and synthesized distributions [17].
As there is not an unique universal metric to measure the performance of these kinds of models, during the last years there has been developed different metrics. Each metric has its particular strength and it should be noted that, in practice, different metrics are used and compared to measure different aspects and to have a wider view of the GAN performance [50].
Since there is not an evaluation metric that fulfills all GAN possible applications, we will review the most widely used metrics:

Inception Score (IS) and its variants
IS [150] measures the quality and diversity of the generated samples of a GAN.
To do so, it uses a pretrained neural network classifier called Inception v3 [165]. The model is pretrained using a dataset of real world images called Imagenet [36], it can differentiate between 1.000 of classes of images.
The IS is calculated by predicting the probabilities of the generated samples. A sample is classified strongly as one specific class means that it has high quality. In other words, it is assumed that low entropy and high quality data are correlated. The IS value varies between 1 and the number of classes of the classifier.
One of the main problems of the IS is that it cannot handle mode collapse. In this case, all generated samples by the GAN will be practically the same, but the IS would be very high if the images are strongly classified as one class. If this happens, the IS could be high and the real situation is very bad.
Other particularity of this metric is that it is designed to measure the quality of images since it uses an image classifier.
Based on IS, there are some modifications to the metric. For example, Mode Score (MS) [129] is a evaluation metric that takes into account the prior distribution of the labels over the data, i.e. it is designed to reflect the quality and diversity of the synthesized data simultaneously.
Other modification of IS is the modified-Inception Score (m-IS) [58]. It measures the diversity within the same class category output, trying to mitigate the mode collapse problem.
Some of them, like Fréchet inception distance (FID) [61] calculate the mean and covariance of the synthesized images and then calculate the distance between the real and generated image distribution. The distance is measured using the Fréchet distance, also known as the Wasserstein-2 distance. The FID is calculated as follows: where w denotes the synthesized data of the G.
The FID is the most common used metric to measure the quality of generated images [79,77,78,33]. The use of a common metric for different architectures allows to compare different results using a common metric. In further sections we will go through different results comparing them using FID.
One of the strengths of using this metric is that it takes into consideration contamination such as Gaussian noise, Gaussian blur, black rectangles, swirls, among others.

Multi-scale structural similarity for image quality (MS-SSIM)
is based on the comparison between two image structures, luminance and contrast at different scales [181]. The MS-SSIM provides a metric that compares the similarity between the real and the synthesized dataset. One of the strengths of MS-SSIM is that it correlates closer pixels with strong dependence. In comparison with other metrics such as Mean Squared Error (MSE), that calculates the absolute error of an image, MS-SSIM provides a metric based on the geometry and structure of the image.
The MS-SSIM scale is based on Structural Similarity Index Measure (SSIM), and this metric is calculated as follows: where x and y are two windows of image of common size, l is the luminance of an image, c the contrast and S the structure. The value of SSIM is a decimal between 0 and 1, the value of 1 represents two identical sets of data. Therefore, it is assumed that the higher value of SSIM, the higher quality of the synthesized images.
MS-SSIM is calculated using the average pairwise of SSIM with N batches. This metric is commonly used with IS or its variations [87] to provide a wider view of the generated data quality.

Classifier Two-sample Test (C2ST)
To measure the quality of the generated distribution, a binary classifier can be used [93]. The classifiers divide the samples into synthesized and real ones, judging whether different samples belong to the same data distribution.
It should be noted that this method is not constrained to image evaluation, since a classifier can be used to classify any given data distribution, it can be adapted to any type of input data. Neural networks can be used as a C2ST, as mentioned in previous sections, D is indeed a classifier of real and generated data. As is proposed in [109], a C2ST can be applied to GANs by using the same composition of the discriminator, as is said in the paper "training a fresh discriminator on a fresh set of data". C2ST-Neural Network (C2ST-NN).
Using C2ST, we can measure the distance between the synthesized and real data distributions. This provides a useful, human-interpretable metric of GAN performance. C2ST has been applied to different GANs architectures such as DCGAN or CGAN, using C2ST-NN and C2ST-1-NN [109].

Perceptual path length
Using the well-known neural network classifier VGG16 [155] the perceptual path length was designed [79] to measure the entanglement of images.

Maximum Mean Discrepancy (MMD)
is used to measure the distance between two distributions [18]. A lower score for MMD means that the distributions that are being compared are closer, and that means that the synthesized data is similar to the original.
Given distributions P and Q and a kernel k. As it is defined in [94], MMD can be denoted as: It should be noted that this method can be used with any type of data.

Human rank (HR)
Human classification can be useful in some cases. Either to complement other evaluation metrics, either because there is not other metric that fulfills the particular problem, human evaluation of the generated data can be done.
Due to the particularity of this method, it can only be used when the synthesized data is comprehensive for a human.
For example, in [211,67] human classifications were applied via Amazon Mechanical Turk (AMT) to evaluate the realism of the outputs of the GAN. In this case, participants had to differentiate between the generated and real images. The more images that fool humans perception, the better.
This method can provide an approximation of how GANs creation would be perceived by humans.

GAN variants
Since the first GAN was developed [51] there has been published many different variations of it [79,67,77,5]. To have a broad vision about recent GAN researches, we will review the recent progress in this field.
This section is divided into GAN models according to their main features. That said, we will divide the different GAN's variations in architecture modification based and loss function modification based.

Architecture optimization
Some recent researches [79,77,211] are focused on the architecture of the GAN is designed. Some of them [77] suggest a change in GANs training, others [79] add changes to the structure of the G or D models.
Despite this, we will review traditional GAN's architecture, we will focus on models that are relevant for GAN recent development. It should be noted that the collection of architectures that we will review should not be considered individually.
GAN model evolution is supported by constant optimization. Therefore, to have a complete vision of GAN evolution, we will go through the different models that have been relevant in the last years.

Deep Convolutional GAN (DCGAN)
One year after, the first GAN was proposed in 2014 [51], the DCGAN was introduced [142] suggesting some changes to the original architecture. The main objective of the DC-GAN is to use convolutional layers instead of the firstly proposed fully connected layers.
The main change to the fully connected GANs is the substitution of the dense layers by convolutional layers. Convolutional layers have been used during the last decade for computer vision tasks. By applying different filters to the images, the convolutional layers are able to extract the main features of the matrix of pixels keeping the correlation between adjacent pixels.
Convolutional layers are used not only used for image processing, but there are recent projects [72] that use matrices of data to take advantage of using convolutional layers.
In addition to the convolutional layers, other changes were suggested to stabilize the GAN's training. Replacing the pooling layers by strided convolution has shown better performance [157,6]. Therefore, it is proposed to use strided convolutions in both G and D.
The use of batch normalization layers in both G and D is proposed, this has been shown to reduce the noise and improve the diversity of the generated samples [99,186].
To activate the convolutional layers, it is proposed to use a Rectified Linear Unit (ReLU) activation for the hidden layer of G, hyperbolic tangent (tanh) for the output layer of G and leaky rectified linear unit Leaky Rectified Linear Unit (LeakyReLU) for D.
In addition to the mentioned changes in the architecture of the GANs, the DCGAN paper also presents a technique to visualize the filters learned by the models. This helps the comprehension of GANs learning methods, confirming previous works related to biology [66].
This architecture supposes a change in how GANs are designed and trained. The innovations that were proposed in the paper are applied in most of the following GAN models.

Conditional GAN (CGAN)
Proposed in 2014 [121], the CGAN architecture adds a latent class label c along with the latent space. The new label is used to split the processed data into different classes, thus the synthesized data is generated according to the class of the input label.
There are some problems that require the generated data to be classified into different classes [108,113,97].
Despite being a simple technique, it has proven to prevent mode collapse. However, the training of a CGAN requires a labeled dataset complicating its application to some problems.
CGAN architecture has influenced GANs model since its proposition, there has been developed many variations [67,25,130].

Auxiliary Classifier GAN (ACGAN)
ACGAN [130] modifies the CGAN structure. The D of the ACGAN does not receive the class label c as an input, instead D is used to classify the probability of the image class. To train the model, the loss function must be modified, dividing the objective function in two parts, one for the correct source of data and the other for the class label. ACGAN loss function can be denoted as: where Ls is the log-likelihood of the correct data distribution and Lc is the loglikelihood of the correct class label.

Interpretable Representation Learning by Information Maximizing GANs (InfoGAN)
One of the mentioned deficiencies of conditional GANs was the requirement of a labeled dataset. InfoGAN [25] provides an architecture to train conditional GANs with an unsupervised method. To do so, the latent class label c is substituted by a latent code vector.
The latent space and the latent code are maximized by using the Mutual Information [154]. The mutual information term is not easy to calculate because it requires the posterior P (c|x). To optimize the training performance, an auxiliary distribution Q(c|x) is defined. Said so, the loss function of the InfoGAN is defined as follows: where λ is a hyperparameter that is in charge of the latent code control. As it is proposed in the original paper [25] a λ equal to 1 is used when the latent code is discrete, for continuous latent codes a smaller λ should be used. The reason for that is to control the differential entropy.

Image-to-Image Translation with Conditional Adversarial Nets (Pix2Pix)
The main objective of the Pix2Pix [67] architecture is to do an image-to-image translation. That is, given an image from a domain A, transform this image to other domain B. For example, given a map of a street, transform the map to an aerial photo of the street on the map.
The Pix2Pix architecture is based on an autoencoder, but skips some connections.
This architecture is known as U-Net, and it is based on the idea of retrieving information at early stages of the network. The same approaches of skipping connections have been used before [60,164,210,71] showing great results and improving the network performance.
In addition to the new architecture, a new loss function is proposed that is denoted as: LGAN As a follow-up of Pix2Pix, Pix2PixHD was proposed [105] improving the quality of the generated images. Many later works have used Pix2Pix [141,124,132,41] converting it to one of the most popular architectures of the last decade.
The immediate application of these algorithms to images has had a great impact on society, radically increasing its popularity thanks to the applications developed.

Cycle-Consistent GAN (CycleGAN)
Cyclic consistency is the idea that, given a data x from a domain A, if the data is translated to a domain B and translated again to the A domain it should be recovered the data x. In other words, if a sample is translated to a domain and recovered from that domain, it should not change. This process, where a data sample is transformed and recovered, is known as cycle consistency, and it has been widely used during the last decades [162,75].
This idea is the main base of CycleGAN [211]. The main strength of the application of cycles is that paired data is not a requirement. GAN architecture adds a new mapping denoted as F, its function is to do the inverse mapping to retrieve the original data. In other words, the function of F is F (G(x)) = x. To train the architecture, a new cycle consistency loss is proposed to train the so-called forward and backward cycle consistency. The cycle consistency loss is denoted as follows: Despite CycleGAN was first proposed for image-to-image translation, it can be used for any data translation.

Unsupervised Dual Learning for Image-to-Image Translation (DualGAN)
The architecture of DualGAN [196] is very similar to CycleGAN. As it was with the CycleGAN, the DualGAN does not require paired data to train its models. To learn the translation from one data domain to another, DualGAN has two pairs of identical G and D, each pair is responsible for their respective translation.
To stabilize the training and prevent mode collapse, the loss format of WGAN [5] is used. This marks the architecture of the network and the construction of the objective function.
In order to train each pair of G and D a reconstruction error term is defined. The reconstruction error objective is the same that it was in CycleGAN, calculating the distance between the original sample of data and its corresponding recovered sample.
The reconstruction error is defined as: while U and V are both domains, λU and λV are two constant parameters and z and z are both random noises. λU and λV are normally set a value within [100.0, 1, 000.0], when the domain U contains real images (e.g. a human face photo) and V does not (e.g. a sketch of human face), it is more optimal to use a smaller value of λU than λV .
DualGAN has been widely used and modified [194,138,100]. For example, in [175] a DualGAN architecture was used to transform an input speech emotion. In this application, given the Fundamental Frequency (F0) of a certain emotion, the trained network is capable of changing the emotion of the sound. To do so, F0 is encoded using wavelet kernel learning [195] using the same methodology as [111].

Learning to Discover Cross-Domain Relations with GANs (DiscoGAN)
DiscoGAN [81] is an architecture that follows the same structure as DualGAN and CycleGAN. The particularity that DiscoGAN has is the usage of an autoencoder for the G. For D, it uses a classifier based on the encoder of the G.
Autoencoders have been used to other reconstruction problems [24,110,118]  A scheme of the progressive training of ProGAN can be seen in Fig.3.
Due to the explained training methodology, ProGAN is capable to stabilize the training of GANs, which is one of the most important GAN problems. In addition, In the proposed dynamic growing algorithm, each step chooses among different growing possibilities: grow G with a certain convolution layer, grow D with a certain convolution layer, or grow both G and D to a higher resolution. A scheme of the training methodology can be seen in Fig.4. If all children were preserved in each step, it will produce an exponential growing that would lead to large inefficiency. To avoid that, before the children generation, a prune is made. Known as greedy prune, the prune is done by keeping the top K children of each generation. Then each child becomes a parent and generates a new batch of children. The process repeats until the network grows to the desired size.
In the original research, the child search was made combining different kernel sizes and number of filters, each parameter is known as an action, and the number of total actions is denoted as T . It can be easily noted that different hyperparameters can be searched by using this algorithm. To avoid a large increment of the number of children, the algorithm proposes a probability p of a child to test a new parameter. A higher K, T and p means a wider search, contributing to a better exploration of the candidates but a slower training.
It should be noted that the search algorithm lacks the efficiency of the architecture by having to do multiples training simultaneously. It also lacks the ability of growing, due to the quick growing of the number of networks.

A Style-Based Generator Architecture for Generative Adversarial Networks (StyleGAN)
StyleGAN [79] is based on the idea that, improving the processing of the latent space, the quality of the generated data will improve. Due to the particularities of the latent space, there are many interpolations on the variables [149,90] that produces entanglement in the learned characteristics of the G. The architecture of the StyleGAN is based on previous style transfer researches [65].
With the architecture of StyleGAN, G is capable to learn different styles of the input data disentangling high-level characteristics. This produces an improvement on the quality of the generated data and helps in the interpretation of the latent space, previously poorly understood. Controlling the latent space leads to better interpolation properties, enabling interpolation operations in different scales, e.g., interpolation of poses, hair or freckles in human face images.
In the StyleGAN architecture, the input of G is mapped to an intermediate latent

Alias-Free GAN
During the last years, multiples architectures have been improving the quality of the synthesized images. The previously mentioned StyleGAN achieved one of the best results in image generation, producing images of human faces with a quality never seen before. Besides its good results, some problems remain opened.
One of the most visible problems that generated images of StyleGAN had was the known as texture sticking. It happens when a certain image feature depends on absolute coordinates instead depending on other feature localization. E.g. the texture of the beard of a human face seems stuck when interpolating different images. The texture sticking problem is noticeable especially when interpolating images, e.g. changing the posture of a human face image.
Alias-Free GAN [78] focus on solving the texture sticking problem of the StyleGAN.
The main idea is to suppress the alias in the generated images, this way the finer details will be attached to the underlying surface of the image. To achieve this, each layer of G is designed to be equivariant by applying rotations and translations to the continuous input.
To achieve an equivariant G, many changes have been made. A 10-pixel margin is used for the internal representations, due to the assumption of infinite spatial extension for the feature maps. The Leaky ReLU layers are wrapped between an upsampling and a downsampling, this is implemented with a CUDA kernel for optimization. The cutoff frequency of the StyleGAN is cut off to ensure the alias frequencies are in the stopband. In addition, the learned input constant of StyleGAN is substituted by Fourier features [166,192]. Finally, the rotation equivariant version of the network is obtained by reducing the kernel size of 3 × 3 convolutions to 1 × 1 and changing the sinc-based downsampling to a radially symmetric jinc-based one.

Self Attention GAN (SAGAN)
SAGAN [199]  SAGAN uses self attention layers [174], these layers are capable to capture structural and geometric features of multiclass datasets. The feature maps of each convolution are split into a 1 × 1 convolution in query, key and value, then they are multiplied to construct the output of the layer. This way the network can learn long-range dependencies. The structure of the self-attention layer can be seen in Fig.5.

BigGAN
The BiGAN architecture [19] focuses on generating high resolution images from diverse datasets. Previous models results were able of synthesize new samples of low dimensionality, they had problems when scaling their results to bigger samples. The results achieved by the BigGAN, in terms of FID and IS outperform previous models.
The researches of the BigGAN claim that GANs have better performance when they use higher dimensional data. The architecture of the BigGAN is based on the SAGAN [199] architecture. The authors show that, by enlarging the number of channels of the images used by a factor of 50%, the IS improve by a factor of 21%.
One innovation proposed in this article is the so-called "Truncation Trick". Previous GAN models used a normal or uniform distribution to generate the latent space of the G network. The authors claim that by using a truncated normal distribution the results, in terms of FID and IS, were better. This truncation trick reduce the variety of values of the latent space by truncating them towards zero. The main drawback produced by this is that the variability of the generated samples is reduced. It exists a relationship between the variety and fidelity of the generated samples using this truncation. The more truncation applied to the latent space, the less variety of images were produced.
Other aspect that is scaled up in this work is the batch size of the GAN training, increasing it by a factor of 8. The authors show that by using larger batches the gradients of each iteration are better, reaching a better performance in less steps.
This is caused because the composition of each batch is more diverse, being able of covering more modes of the data.

A GAN Through Quantum States (QuGAN)
During the last decade, quantum computing has become a hot topic in computer science. Since it was proposed in 1980 [14] it has always been restricted to a few laboratories around the world. Thanks to the progress made recently [114], it has made possible to test the first algorithms, prototypes and ideas [22].
Thanks to quantum computing particularities, problems previously defined can be solved, or are optimized, reducing their computation time. Using quantum superposition, the multiples solutions can be evaluated simultaneously, then by using quantum interference and entanglement the correct answer can be defined.
QuGAN [158] proposes a GAN architecture powered by quantum computing. By using quantum computing, GANs are hugely optimized, reducing a 98.5% of its parameter set compared to traditional GANs.
QuGAN architectures use qubits to create the quantum layers of G and D, known as QuG and QuD. The data that the networks use is transformed into quantum states.
Benefiting from the entangling properties of quantum circuits, EQGANs guarantees the convergence to a NE.
The main particularity of EQGAN is that it performs quantum operations on both synthesized and real data. This approach produces fewer errors than swapping the data between quantum and classical.
To apply EQGAN to real problems, a Quantum Random Access Memory (QRAM) is used. By using the QRAMs, the EQGAN is capable to improve the performance of the D.

Classification Enhancement GAN (CEGAN)
Data imbalance is a common problem when using real world datasets. Dataset often contains a majority of samples of a certain data class. In the case of GANs using unbalanced datasets, the imbalance problem results in poor quality of the synthesized data of the class with less samples.
CEGAN [160] tries to solve the data imbalance problem in GAN. The objective is to enhance the quality of the synthesized data and to improve the accuracy of the predictions.
The CEGAN architecture consists of 3 different networks, G, D and a new network known as the classifier (C). The training of the CEGAN divides in two steps. In the first step, the architecture is normally trained, using D to differentiate between fake and real samples, C is used to classify the class label of the input sample. Then, in the second step, an augmented training dataset is formed via generating new samples from G, and this new dataset is used to train the C.
The methodology presented in CEGAN substitutes previous techniques to deal with data imbalance. Unlike other methods such as undersampling [126] or oversam- One of the strengths of the SSD-GAN is its simplicity, easing implementation and allowing its implementation on various network architectures without excessive cost. The

Mobile Image Enhancement GAN (MIEGAN)
The MIEGAN [135] presents a novel architecture that aims to improve the quality of images taken with a mobile phone. To do so, two new networks are proposed, the so-called multi-mode cascade generative network and the adaptive multi-scale discriminative network. The generative network is composed of an Autoencoder architecture. The encoder of this new generator is divided into two streams, the inclusion of the second encoder is in charge of improving the low luminance areas, where mobile phones particularly lack in their clarity.
The discriminator network has a dual goal. First, the global discriminator ensures overall image quality. Second, a local discriminator maintains the local quality of small areas of the image. To combine both objectives, an adaptative weight allocation module is also proposed that is responsible for balancing the importance of each discriminator.
A brief scheme reviewing all presented architecture variant GANs can be seen in

Loss function optimization
Orthogonal to the architecture modification GANs, there are many researches [5,140,116]   other.
In this section, we will review the different most important and recent progress in variations of the loss function of the GANs.

WGAN
The base of the WGAN [5] is the application of the Earth Mover (EM) distance, also known as Wasserstein-1 distance. The Wasserstein distance is defined as: In other words, the Wasserstein distance calculates the cost of transforming the distribution Pr to the distribution Pg. In the case of GAN, the Wasserstein distance will measure the difference between the real and synthesized data distributions.
In order to apply the new objective function, some changes must be applied to the architecture of GANs. The D of the GAN changes its objective, but previously D was used to distinguish which data was real and which was synthesized in WGAN D change its name to critic. The critic function is to measure the realness of an image, e.g. the probability that the image belongs to the real distribution. The weight change of the critic is fixed between a window (e.g. between [-0.01, 0,01]) after each gradient update. The weight clipping is done to make the parameters lie in a compact space, due to the change of the critic network.
The EM distance has shown to produce better gradient behavior than other met-

WGAN-GP
In the original paper of WGAN, the authors suggest that weight clipping is "a terrible way to enforce Lipschitz constraints". Weigh clipping is one problem that the original WGAN had, but it worked well enough and its implementation was easy. The WGAN-GP [55] proposes a new technique to substitute the weight clipping that leads to the WGAN with undesired behavior.
The proposed change involves constraining the critic gradient norm output regarding to the input of the network. The constraint is softened via a penalty on the gradient norm. say that the new loss function is denoted as follows: The new change makes the WGAN-GP optimize its training, stabilizing it with almost no hyperparameter tuning. The new loss function also improves the quality of the generated images over WGAN and converges faster.

Loss-Sensitive GAN (LS-GAN)
In order to measure the quality of the synthesized samples of data created by G, a new loss function is used in the LS-GAN [140]. The new loss function aims to use regularization theory to improve the performance of GANs architecture. The main idea behind the new loss function is that a real sample produces smaller losses than a synthesized one, the margin between both is predefined. Once this assumption is set, we can infer that the training of G must aim at minimizing the loss margin between real and synthesized images. The proposed loss function is denoted as follows: where λ is a hyperparameter for balancing and θ are the parameters of D.
The loss function is regularized via Lipschitz regularity condition over the density of the real data. Due to the regularization, the created models are better in generalization of new data.

Least Square GAN (LSGAN)
The new loss function presented in LSGAN [116] aims to reduce the vanishing gradient problem. The main objective of the LSGAN is to punish the synthesized samples that are far from the real data but still in the correct side of the decision boundary. The least squares loss function is denoted as follows: where a and b are the labels for fake and real data respectively and c is the label that G wants D to believe is real data. It should be noted that the square of both equations is responsible for punishing far from the decision boundary samples.
The LSGAN tries to generate more gradients while penalizing samples that lie a long way from the decision boundary. This way the gradients are forced to be higher, preventing the gradient vanishing problem. Compared to the classical sigmoid cross entropy loss function of GANs, the new least squares loss is flat only at one point as we can see in Fig.8.

Unrolled GAN (UGAN)
The UGAN [119] loss function is defined to prevent instability in GANs training.
The idea behind UGAN is to dynamically adapt G and D to prevent the situation of unbalance, where one of the networks is more trained than the other. Commonly, due to the particularity of the problem to solve, the D problem is easier to solve than the G one, producing an imbalance in favor of the D.
The training of UGAN is dynamically changed, the presented loss is surrogated for training the G. The surrogate objective function is created by unrolling K steps of D for each update of the G. Using the proposed loss function, the G behavior adapts to the training state of the D. The surrogate loss function is defined as follows: With the application of the proposed loss function, the UGAN demonstrates to stabilize the training by adjusting and synchronizing G and D networks. Furthermore, it prevents mode collapse, avoiding the model to drop regions of the data distribution.
Despite this, the most important weakness of the UGAN is its computational cost.
When the generator loss is optimized, the performance of the network drops. It depends on the particular problem how many unrolls need to stabilize its training. In the original paper, for example, it varies between 1 and 10.

Realness GAN
where A0 and A1 are the fake and real distributions.
Using the new loss function, the RealnessGAN is capable of recovering more modes than a standard GAN, preventing mode collapse. Furthermore, RealnessGAN shows a better performance, generating higher quality images in both real-world and synthetic datasets.
One of the strengths of the RealnessGAN is its simple implementation, due to the fact that RealnessGAN is a generalization of the original GAN. That said, despite being one of the most recently proposed architectures, it is expected to be widely used due to its good results and easy implementation.

SN-GAN [122] proposes a new technique to normalize the weights of D networks.
A more stable training is searched through spectral normalization.
Respect previous normalizations [151] spectral normalization is easier to implement. The previous methods imposed a much stronger constraint on the network matrix. With the spectral normalization, it is possible to relax this constraint, allowing the network to satisfy the local 1-Lipschitz constraint. The spectral normalization is defined as follows:W where W is the weight matrix of D and σ(W ) is the L2 normalization of W.
As mentioned before, the proposed D network is very simple and additionally its computational cost is small. It also requires the tuning of one hyperparameter, the Lipschitz constant.
The generated images using SN-GAN are more diverse, achieving better comparative IS respecting other weight normalizations.
Previous works developed architectures for concrete domains of translation, CSGAN proposes a common framework for different domain translation.
The Cyclic-Synthesized Loss (CS) is proposed as the objective function of CSGAN.
The new loss objective is to evaluate the differences between a synthesized image and its correspondent cycled image. The proposed loss function is denoted as follows: were LCS A and LCS B are the Cyclic-Synthesized loss of both domains.
With respect to previous architectures, CSGAN produces images of better quality, notably reducing the artifacts of the synthesized images. The results show better performance of CSGAN in Chinese University of Hong Kong (CUHK) dataset [179] and comparable performance in FACADES dataset [171]. The comparison of the performance is made against GAN [51], Pix2Pix [67], DualGAN [196], CycleGAN [211] and Photo-Sketch Synthesis using Multi-Adversarial Networks (PS2MAN) [178].

Multi-IlluStrator Style GAN (MISS GAN)
The proposed architecture of MISS GAN [11] presents only one trained model to  To train the MISS GAN models five different objective functions are proposed.
The first loss function is called the adversarial objective (L adv ) and it is in charge of, taking the input image and the target domain, ensure that the generated image style corresponds with the target domain. To do so the L adv takes two discriminator predictions, one for the input image and other for the synthesized image.
The second loss function is denoted as style reconstruction objective (Lsty), and it enforces the G to use the mapping network style code while receiving a generated latent code, to calculate the Lsty the output of the G encoder over the generated image.
The third proposed objective function is called style diversification objective (L ds ) and it compares a pair of synthesized images, each image corresponds to a different style code, each one generated from a different latent code. The objective of this loss function is to force G to produce diverse images, preventing two images with different latent codes from being the same.
The fourth objective function is the cycle consistency loss (Lcyc) used in the Cy-cleGAN [211].
Finally, the fifth objective function is called content features loss (L content f eat ), and it computes the distance in the feature space by using a VGG16 [155] network.
To combine the different objective function a total objective is defined as follows: +λcycLcyc + λ f eat L content f eat (22) where E is the style encoder and F is the mapping network, all the λ parameters correspond to a hyperparameter for each objective function.
where d r s denotes the r-th moment distance between a sample and the north pole of the hypersphere.

Super Resolution GAN (SRGAN)
In order to apply GANs to image upscaling the SRGAN [92] was proposed. The proposed GAN objective is to take an input natural image and upscale it resolution by a factor of 4.
To achieve the super resolution, the new variant proposes a couple of adversarial and content losses. Both functions are combined using the called perceptual loss function, this function is in charge of ass solution respecting the relevant characteristics of the data. The content loss is defined as follows: l SR = l SR X + 10 −3 l SR Gen l SR Gen (24) where l SR Gen is the adversarial loss and l SR X is the content loss. The content loss used relies on a pre-trained VGG-19 model [155]. This model, respecting the usage of a loss function such as MSE is more invariant to changes in pixel space. This metric will provide the network information about the quality of the content of the synthesized image. The new loss function is calculated as: (φi,j(I HR )x,y −φi,j(G θG (I LR ))x,y) 2 (25) where I LR refers to the low resolution images and I HR refers to the high resolution image.
In addition to the content loss, the adversarial loss is defined as being this part of the generative component of the GAN. This function is responsible for pushing the generated images to be realistic and indistinguishable from the real ones. The loss function is defined as: The application of SRGAN improves the results of previous algorithms for image super resolution.
Since the introduction of the SRGAN it has been used in many different applications [203,35,209]. In addition, there are works such as [103] that presents some improvements in the SRGAN structure, the new architecture is known as Super Resolution Channel Attention GAN (srcaGAN). The architecture presented in this papers adds a channel attention module to the models, this module recovers the attention layer used in SAGAN [199]. The results presented in this new architecture outperforms the SRGAN.

Weighted SRGAN (WSRGAN)
One of the characteristics of the SRGAN [92] was the combination of the content loss and adversarial loss during the training. The WSRGAN proposes is changing the importance of each loss and studying the effect of this action.
The main objective of the WSRGAN is to improve the performance of the architecture by analyzing its performance in different combinations of its objective functions.
Then the new weighted loss function is defined as follows: where w is the parameter that controls the impact of each loss function on the final result.
After training the network with different weight configurations, the paper concludes that the MSE loss is the most important loss function, being supported by the VGG loss.
Additionally, the definition of the weight parameter is declared dynamically, obtaining even better results than when it is static.
A brief scheme reviewing the different presented loss function variant GANs can be seen in Fig.10. We divide the different GANs in different groups based on the proposed changes in the loss function.

GAN timeline
A timeline with the reviewed architectures is presented in figure 11. The GANs that

GAN applications
As mentioned before, GANs are one of the most popular applications of machine learning of the last years. GANs models can achieve results in fields where previous models could not, in other cases, GANs improve the previous results significantly.
In this section, we will review the most important fields where GAN architectures are applied, paying a special attention to the GAN models related to computer vision tasks and we will compare the different architecture results.
Most of the last researches focus on how to apply GANs to generate new synthesized data, replicating a data distribution. But, as we will review in this section GANs can be applied to other fields, e.g. video game creation [80].

Image synthesis
One of the most important fields in which GANs are applied is in computer vision. In particular, realistic image generation is the most widely used application of GANs [79,77,5].
Most of the proposed GAN variants are tested by generating real world images.
Arguably, image synthesis is the first application one might think of when thinking about GAN. Its popularity is due to the good results that GAN can achieve. Compared with previous methods, GANs provide sharper results [47]. Both in academic world and for the general public GAN has raised a lot of interest.
One of the main reasons of the GAN success is its results easy understanding.
As the mainly generated output of GANs are images, they can be easily understood by anyone. Even if a person does not have any technical understanding of artificial intelligence, it is possible to judge the results.
Within computer vision, image generation is the most used method to test GANs.
There are plenty of real world images datasets that can be used to train GANs. The availability of datasets that can be used for training neural networks is usually the main drawback of artificial intelligence projects. Either by its availability or by its content [37] having a good dataset is essential for machine learning. When real world images are used to train GAN models, the availability of good datasets is not a problem, there are a large variety of datasets [36,86] that have been widely tested and are well known in the academic community.
Since the first GAN publication [51] GAN architectures have been used for synthesizing real world images. In the original proposed GAN the models were used to generate images replicating MNIST [91], CIFAR-10 [86] and Toronto Face Database (TFD) [163] datasets. The generated images using the original structure were very blurry and did not have good quality. Besides that, the presented results supposed the presentation of the GAN architecture.
One of the first improvements to the original architecture was the DCGAN [142], it proposed structural changes and hyperparameter tunning respect the first proposed model. The results of the DCGAN showed improvements in the performance and generation of the networks, the generated images were clearer and more recognizable.
Despite that, the architecture still suffered from instability and mode collapse.
The WGAN architecture [5] could reduce drastically the mode collapse and instability of the previous models. Thus, later models adapted the loss function of the WGAN along with their respective structural changes in the network. The Table 2 summarizes the performance of the presented GAN models during this section. The compared datasets are MNIST [91], TFD [163], CIFAR-10 [86], CelebA-HQ [77] and Flickr-Faces-HQ (FFHQ) [79]. The used metric for comparing the different variants are accuracy of the models (the higher score the better ↑), IS (the higher score the better↑) and FID (the lower score the better ↓)

Image-to-image translation
Taking an image from one domain and converting it to the other domain is known as image-to-image translation. It was first proposed with the Pix2Pix architecture [121], Pix2Pix is based on CGAN following the idea of generating images conditioned on their composition via a label input. With Pix2Pix the networks are capable of learning how the same image is translated between one domain and another.
The main drawback that Pix2Pix had was the requirement of having a paired dataset of images in both domains.
Following the steps of Pix2Pix CycleGAN [211], DualGAN [196] and DiscoGAN [81] were developed. These new architectures were based on the cyclic consistency idea.
Cyclic consistency was previously used in machine learning [162,75], it is based on the idea that translating an image from one domain to another and then doing the reverse operation will recover the original image. Following this concept the new networks were capable of translating images without a paired dataset. By not needing a paired dataset the number of possible applications of GAN to image-to-image translation increased considerably.
Later on the CSGAN was proposed [76] improving the results of previous architectures. The new proposed loss function achieved better results in image generation, comparing with CycleGAN [211], DualGAN [196], DiscoGAN [81] and PS2MAN [178]. behind GAN applications they act as a catalyst to make more people interested in artificial intelligence and, ultimately, it will bring more people to academic research in the field.
The Table 3 summarizes the performance of the presented GAN models in imageto-image translation tasks. The data is obtained from [76], where the SSIM (the higher score the better ↑), MSE (the lower score the better ↓), Peak Signal to Noise Ratio (PSNR) (the higher score the better ↑) and Learned Perceptual Image Patch Similarity (LPIPS) [205,101] (the lower score the better ↓) are computed for different GAN variants. The comparison is made for CUHK [179] and FACADES [171] datasets.
The LPIPS is a metric that measure the distance between the real and the generated distribution via perceptual similarity.

Video generation
GANs have proven to generate state-of-the-art results in image processing. Along with image generation comes the possibility to generate a set of images generating a video. Video generation is a more complex task than image generation. The issues associated with image generation are included in video generation, but the computational cost of training models that can process video is high. In addition, the synthesized videos must be coherent.
One of the particular problems of video is the motion blur generated by the networks [57]. When a video is generated, the tracking of some objects can be difficult, generating fuzziness in some portions of the image. Some works have tried to tackle this problem [204,197,146], but it is still an open problem.
One of the most popular applications of video generation with GANs is the known as deep fake. Deep fake consists in taking a video of a person and changing the face of the human to be someone else. Many works have been developed in the last years in this field [183].
Deep fake is one of the most controversial applications of GAN, the possibility of changing a face in a video allows to generate fake videos that can be used to supplant a person. This problem is magnified in the case of women [117] due to their position in society. Even so, there are some applications of deep fake where it can be beneficial [89], its application still raises doubts in the society. This is why many recent researches have focused on how to detect deep fake videos [84,40,23,208].
Other application of GANs to video generation are video-to-video translation, which is indeed the general case of deep fake. Many architectures of this type have been proposed during the last years [27,9].
It should be noted that, in the case of video processing, the standard is to use previous information, such as another video, to generate the synthesized data. Unlike image generation, video generation is more interesting if the new information is conditioned by an external agent. In image processing, the only input was the latent space, but the final images were conditioned by the dataset of the training. When videos are generated, the degree of freedom is extended, enabling the generated data to be less controlled. Controlling the video output is necessary to maintain the coherence of the final output, but it also eases the GAN job, which is significantly more difficult with respect to image processing.

Image generation from text
Since the introduction of CGAN the capabilities of GANs were expanded. The possibility of constraint the synthesized information that GANs produced made the networks have a wider range of application. By controlling the output of the generations of the networks the applications of them can be much more specific and interesting. One field were GANs have shown to outperform previous techniques in image generation from text [88].
Stacked GANs (StackGAN) [201]  The Table 4 summarizes the performance of the presented GAN models during this section. In addition to the mentioned networks the Generative Adversarial Text to Image Synthesis (GAN-INT-CLS) [144] and the Generative Adversarial What-Where Network (GAWWN) [145] are included, both of this networks act as a reference of previous architectures. The compared metrics are HR (the lower score the better ↓), IS (the higher score the better ↑) and FID (the lower score the better ↓). The compared datasets are Common Objects in Context (COCO) [102], Caltech-UCSD Birds (CUB) [176] and Oxford-102 [127].

Language generation
GANs models have been used during the last years in Natural Language Processing (NLP) tasks. The previously mentioned text-to-image field is one of the applications of GAN where natural language is involved. But there are some applications of GAN completely focused on how to produce new text using the models.
Previous methods to process natural language used the known as Long Short-Term Memory (LSTM) [64]. LSTM is capable of maintaining local relationships in space The textGAN approach to language generation, suffering from the known as exposure bias. This bias is caused by the objective function of the network, that focus on maximizing the log likelihood of the prediction. The exposure bias is visible in the inference stage, when the G generates a sequence of words iteratively predicting each word based on the previous ones. The problem comes when the prediction is based on words never seen before in the training stage. Some works were made to tackle this problem [13] but the Sequence GAN (SeqGAN) [198] is the architecture that betters the results produced.  [20]. The results of the SeqGAN shows a huge improvement in tasks such as language generation, poem composition and music generation. In addition, the performance of the models shows certain creativity in the synthesized data.
Despite the good results of GAN in NLP tasks during the last years, there have been developed architectures that outperform GANs in language generation. The most successful architecture of this field is the Generative Pre-trained Transformer 3 (GPT-3) [44], which belongs to the GPT-n series. The GPT-3 is a generator model based on the transformer [174] architecture. The extraordinary results presented by the GPT-3 are often very difficult to distinguish from human writing. The emergence of the GPT-3 caused a lower interest in GAN models applied to NLP. Due to the good results of transformers in NLP, the GAN approximation to this field has been losing interest.

Data augmentation
Other field where GANs have shown to be really useful is in data augmentation.
Due to the particularities of the GAN they can be used to obtain more samples of an origin data distribution, replicating its distribution. This way, by using GANs, the number of samples of a dataset can be multiplied.
Traditionally, data augmentation was achieved via transforming the initial data; e.g. cropping, rotating, shearing, or flipping images. One of the main drawbacks of these methods is that they transform the original data by slightly changing their structure, with the usage of GANs for data augmentation the new samples tries to synthesize new data from the original distribution. Instead of changing the samples of the dataset the generated samples of GAN are synthesized from scratch. This way, the new data is replicated by imitating the original data distribution. It should be noted that data augmentation does not necessarily replace other methods of data augmentation, it proposes an alternative that, in many cases, can be used together with other data augmentation algorithms.
For example, the Data Augmentation Optimized for GAN (DAG) [170] proposes an enhanced data augmentation method for GAN, combining it with data transformation such as rotation, flipping or cropping. The DAG shows to improve the performance of data augmentation in GAN models, improving the FID of CGAN, Self-supervised GAN (SSGAN) and CycleGAN. The proposed architecture uses one D for each transformation of the data, but a unique G.
Data augmentation with GANs have been used in cases where obtaining a dataset is difficult. For example, in medical applications there is usually not many information available, in this cases GANs can make the difference. This is why during the last years GANs have been used in medical data augmentation [45,82,139,59].

Other domains
As mentioned before, due to the particularities of the GANs they can be applied to many different fields. One of the main strengths of the machine learning is that it adapts to different situations without substantial changes in its structure. In particular, GAN can be adapted to any type of data distribution as long as there is an available dataset.

GameGAN
One of the most interesting applications of GAN is the presented with the GameGAN [80].
The main purpose of GameGAN is to generate entirely a video game using machine learning. To do so, the complete Model-View-Controller (MVC) software design patterns is replicated using artificial intelligence. The proposed architecture is composed by three different modules.
The dynamics engine is in charge of the logic of the whole system, maintaining the global coherence and updating the internal state of the game. The dynamics engine, for example, controls which actions of the game are possible (e.g. eating a fruit in pac-man) and which ones are not (e.g. run through a wall in pac-man). The dynamics engine is composed by an LSTM that updates the state of the game in each frame, the LSTM provides the network way to control the previous states of the game to calculate the new information of the subsequent frames. This way, the network can access to the complete history of the game, maintaining the consistency of the system.
To save the state of the game a memory module is used. This module focus on maintain long-term consistency of the game scene. When the game is being played there are different elements of the scene that not always are visible, with the memory module these elements are consistent over the time. This memory remembers the generated static elements of the game. The memory module is implemented by using Neural Turing Machine (NTM) [52].
The third module that composes the system is the rendering engine, it is in charge of generating a visualization of the current state of the game. This module focuses on representing the different elements of the game realistically, producing disentangled scenes. The rendering engine is composed by transposed convolution layers that are initially trained using an autoencoder architecture to warm up the system and then they train along with the rest of the modules.
The adversarial training of GameGAN has three types of discriminators. The single image discriminator evaluates the quality of each generated frame, judging how realistic it is. The action-conditioned discriminator determines if two consecutive frames are consistent with respect the input of the player. Finally the temporal discriminator maintains the long-term consistency of the scene, preventing elements from appearing or disappearing randomly.
One of the basis of GameGAN is the disentangling of dynamic and static elements of the game. The static elements of a game could be, for example, walls while the dynamics elements of a game are elements such as nonplayable characters. By disentangling both types of elements, the game behavior is more interpretable for the model.

Medical imaging GANs
One of the most popular application of the GAN architecture is to enlarge datasets.
The objective of synthesizing new data is to produce larger datasets that improve the performance of machine learning models, which are very sensible with the number of samples used in their training.
There are many fields where data augmentation can be applied, but in medical imaging to augment data have certain benefits due to the particularities of the problem.
First, the medical datasets are usually small, because of the cost of obtaining the images, most of the time it is necessary to use measurement and recording machines such as radiography, magnetic resonance or ultrasound. But, in addition to the cost of obtaining these images there also exists ethical and legal problems related to the nature of the data. Most of the time, obtaining images that expose the health status of different people is impossible, which leads to even more lack of available data.
It should be noted that one of the benefits of generating data with GAN is that the new samples do not belong to any real person.
Because of all these factors, there has been lots of GAN works related to the medical imaging field [56,123,172,153,85,188]. In addition, the work of Chen et al. [28] analyses the evolution of the field of medical data augmentation and suggests that the research in this field remains strong in the year 2021, despite that the fact that from 2019 onwards the number of published works have been the same.

GANs in agriculture
Similar to the medical imaging field, obtaining images to train the computer vision models of agricultural image analysis is not an easy task. These models benefit from having large-scale balanced datasets but the cost of obtaining high quality labelled data makes the data augmentation a crucial task in these datasets.
Many different GAN models have been applied to agricultural data, such as [95,191,69]. These works aim to generate new images of plant with different diseases, augmenting the number of samples by using GAN.
In these cases the use of GAN improves the results of the machine learning models by enlarging the number of available data. The agricultural images have different particularities that make the analysis of them a difficult task. For example the biological variability between two samples of the same specie makes crucial to have many different samples to learn all the modes of the data. In particular, the same leaf of a fruit can drastically differ from one individual to another.
Other important factor is that the labelling of the data can be very costly, specially for specific applications such as the disease detection of certain plant, e.g. tomato leaf [95].
In addition, the environment where the images are taken, most of the time in crops, can lead to many variance in the images, such as lighting changes or object occlusion.

Drug discovery using GANs
The process of discovering and designing new drugs has recently been impulsed by the field of Deep Learning [70,32]. In particular, GANs are an useful technique to synthesize new useful samples of data. In the drug environment, the GAN architecture can process the drug compound using graphs or Simplified Molecular Input Line Entry Specification (SMILES), to then generate synthetic samples of drugs.
Due to the flexibility that ANNs have in terms of operating with different data types, it is possible to use the same architectures in different fields. In this case the overall GAN design can be adapted to molecular data, being able to transfer the same principles of the image generation to new data types.
The research followed by Kadurin et al. [73,74] generates new drug compounds for anticancer therapy, using biological and chemical datasets. In particular, in [74] it is used an Adversarial Autoencoder that uses molecular fingerprints as inputs of the network. By using this architecture the researches are able of define the desired properties of the synthesized drugs. Some of the new synthetic drugs discovered by the Deep Learning architecture corresponded with previous known anticancer drugs.
This led the researches to suggest that the remaining unknown drugs generated by the GAN could be used to further study their properties.
The work presented in [131] proposes the generation of new drugs combining GANs with reinforcement learning techniques. In particular, the proposed G takes as input a random latent space and process it with RNN to produce a sequence of drug by using SMILES representation. The D on its side uses a 1 dimensional CNN to distinguish the real data from the synthesized one. The results of the paper suggest that the new drugs discovered were unique and diverse. This may alleviate the first phases of drug development, which are very expensive in terms of time.
The Federated Generative Adversarial Network for Graph-based Molecule Drug Discovery (FL-DISCO) architecture [115] aims to combine the generation potential of GAN with the processing of molecules using graphs of the Graph Neural Networks while maintaining the privacy of the data using Federated Learning [83]. By using graph representation of the molecules instead of SMILES as previous works, the represented samples have more realistic structures, maintaining structural relationships of the connected atoms of the molecules. The Federated Learning framework is based on using different clients to train a specific neural network model, each client has its respective portion of the data, which uses to train the network. This way each client knows a portion of the data and uses it to update the central model, but it maintains the privacy due to the fact that the clients are not able to communicate with each other. The results of this research show progress in terms of novelty and diversity of the synthesized drugs respect previous works.

Discussion
Since their introduction in 2014 GANs have been the most important generative architecture in computer vision. The results provided by the developed GANs were notoriously better than previous architectures, such as Variational Autoencoders. This leaded to a constant improvement of the model, solving problems like stabilization or mode collapse.
With the introduction of the Diffusion models [38,63,156], the results of GANs have been surpassed by this new models solving some of its most important problems.
Some aspects in which diffusion models outperform GANs are better stability, they do not suffer from mode collapse and they provide more diverse results. This is mainly caused because of the fact that they are likelihood-based [30]. Despite the better results of diffusion models they still have shortcomings in some aspects such as the cost of synthesizing new samples, which makes them difficult to being applied in real-time problems.
In [148] it was developed a diffusion model to perform an image-to-image translation. The results showed in this research show that their solution outperforms GANs without special attention to the hyper-parameter tunning or any kind of sophisticated technique or loss function. Moreover this research shows the great stability of the diffusion model architecture.
Despite the fact that Diffusion models are a novel architecture with not many works published, it is a very potential architecture to surpass GAN results in a near future. At present, there are not enough results or applications of diffusion models to data generation, but the potential of this new architecture could lead to a significant improvement in the results of data synthesis. We consider that this models could replace GANs because of their stability and not needing fine-tunning in their hyperparameters.
Other new architectures have been used to enhance the results of GANs, such as transformers, to improve their results. Transformer architecture is a time-series-based architecture that adopts the self-attention layers [174] making possible to design larger models. Transformers have been used as the base neural model of the G and D of the GAN architecture, improving the performance of the model.
The TransGAN [68] presents a GAN architecture free of convolutions that makes possible to generate high resolution images by using transformer in both G and D of the GAN. The results of the article shows an improved results respect to the IS and FID on CIFAR-10 dataset [86].
Another work that showcases the interaction between GANs and transformers is the one presented in [112]. This work uses the generative model to predict pedestrian paths, using the memory that the transformer architecture has. In this sense, the GAN makes possible to train the network to predict future paths of pedestrians, while the transformer provides the memory to process an historical sequence of the latest movements.

Conclusion
This report summarizes the recent progress of GANs, going from the basic principles in which GAN are sustained to the most innovative architectures of the last years.
In addition, the different problems that GANs can suffer are categorized and the most common evaluation metrics are explained and discussed.
Respect the recent progress in the field, a taxonomy for the GAN variants is proposed. The researches are divided in two groups, one with the GANs that focus in architecture optimization and the other with the GANs that focus in objective function optimization. Despite being two separate groups of variants, it should be noted that the different researches benefit from the progress of the rest. These ecosystem where there are various approaches for GAN development is connected with the main problems that are reviewed in this survey, since normally each research focus in trying to solve a certain problematic of previous researches.
Finally the different application of the GANs during the last years are summarized.
The different applications of GAN are influenced by the development of the field, its impact in the society and in the industry. We conclude with a comparison between the different architectures performance to provide a quantitative view of the evolution of GANs.