DiverGAN: An Efﬁcient and Effective Single-Stage Framework for Diverse Text-to-Image Generation

the best variety and quality. Both qualitative and quantitative results on benchmark data sets demonstrate the superiority of our DiverGAN for realizing diversity, without harming quality and semantic consistency. (cid:1) 2021 The Authors. Published by Elsevier B.V. ThisisanopenaccessarticleundertheCCBYlicense(http:// creativecommons.org/licenses/by/4.0/).


Introduction
The goal of text-to-image synthesis is to automatically yield perceptually plausible pictures, given textual descriptions. Recently, this topic rapidly gained attention in computer-vision and natural-language processing communities due to its extensive range of potential real-world applications including art creation, computer-aid design, data augmentation for training image classifiers, photo-editing, the education of young children, etc. Nevertheless, text-to-image generation is still an extremely challenging cross-modal task, since it not only requires a thorough semantic understanding of natural-language descriptions, but also requires a conversion of textual-context features into high-resolution images. The common paradigm involves a deep generative model implementing the cross-domain information fusion.
Thanks to the recent advances in the conditional generative adversarial network (CGAN) [2], current text-to-image approaches have made tremendous progress in image quality and semantic consistency when given natural-language descriptions as inputs. Existing text-to-image approaches can be roughly cast into two types. The first category adopts a multi-stage modular architecture [3][4][5][6][7][8][9] constructed with multiple generators and corresponding discriminators, producing visually realistic samples in a coarse-to-fine manner. Simultaneously, the output of the initial-stage network is fed into a next-stage network to generate an image with a higher resolution. While such a text-to-image procedure has been proven to be useful for the generation task, it requires training of several networks and is thus time-consuming [1]. Even worse, the convergence and stability of the final generator network heavily relies on the earlier generators [10].
To address the issues mentioned above, a second category of text-to-image synthesis methods has been recently studied in [1,10], merely leveraging a generator/discriminator pair to produce photo-realistic images which semantically align with corresponding textual descriptions. Thanks to the power of deep generative neural networks, the feature fusion and generation process can be integrated into one single-stage procedure.
Although the methods in this second category achieve structural simplicity and superior performance, there still exist two significant issues. Firstly, these approaches suffer from the modecollapse problem [11], in which the generator derails and synthesizes an inappropriate image for the text input. In less unfortunate cases of mode collapse, the CGAN generates a set of very similar output images, conditioned on the single natural-language description at the input. As shown in Fig. 1 , DF-GAN [1] fails to produce diverse samples, although noise is present on the input. This serious obstacle will drastically degrade the diversity of generated images, limiting their applicability in practice. For instance in data augmentation, robust classification can only be achieved if a wide range of shapes were generated. Secondly, existing single-stage approaches modulate the feature map by just adopting a global sentence vector, which lacks detailed and fine-grained information at the word level, and prevents the model from manipulating parts of the image generation according to natural-language descriptions and qualifications [5,6].
One possible explanation for the lack-of-diversity issue is that the single-stage generator focuses more on the textual-context information that is high-dimensional and structured but ignores the random latent code which is responsible for variations [12]. Considering the fact that single-stage methods usually utilize modulation modules to reinforce the visual feature map for each scale in order to ensure image quality and semantic consistency, the conditional contexts are likely to provide stronger control over the output image than the latent code, and thus, the generator yields nearly identical instances from a single text description.
In this paper, we propose to develop an efficient and effective single-stage framework (DiverGAN) for yielding diverse and visually plausible instances that correspond well to the textual contexts. DiverGAN consists of four novel components discussed as follows.
Firstly, in order to stabilize the learning of the CGAN and boost the visual-semantic embedding in the visual feature map, Conditional Adaptive Instance-Layer Normalization (CAdaILN) is introduced to normalize the feature map in the layer and channel while employing the detailed and fine-grained linguistic cues to scale and shift the normalized feature map. Moreover, CAdaILN can help with flexibly controlling the amount of change in shape and texture, allowing for deeper networks and complementing the following modulation modules. For example, a good generator should be able to respond to the requirements specified by textual qualifiers for size ('large') and/or color ('red').
Secondly, we design two new types of word-level attention modules, i.e., a channel-attention module (CAM) and a pixelattention module (PAM). These modules capture the semantic affinities between word-context vectors and feature maps in the channels and in the (2D) spatial dimensions. CAM and PAM not only guide the model to focus more on the crucial channels and pixels that are semantically correlated with the prominent words (e.g., adjectives and nouns) in the given textual description, but also alleviate the impact of semantically irrelevant and redundant information. More importantly, with CAM and PAM, DiverGAN can effectively disentangle the attributes of the text description while accurately controlling the regions of synthetic images.
Thirdly, we present a dual-residual block constructed with two residual modules, each of which contains convolutional layers, CAdaILN, exploiting a ReLU activation function followed by a modulation module. The dual-residual block not only benefits the CGAN convergence by retaining more original visual features, but also efficiently improves network capacity, resulting in highquality images, with more details.
The text-to-image pipeline built upon the above three ingredients is enough to be capable of producing perceptually realistic images semantically matching with the natural-language descriptions but still suffers from the lack-of-diversity problem.
Therefore, as the fourth remedy we propose, on the basis of a variety of experiments, to use a linear layer which significantly boosts the generative ability of the network. This dense layer forces the generator to explore a wider range of modes in the original data distribution. In other words, plugging a linear layer into the single-stage architecture will improve the control of a random latent code over the visual feature map, balancing the trade-off between a random latent code contributing to diversity and modulation modules that modulate the feature map based on wordcontext vectors. Simultaneously, the experimental results will indicate that inserting a linear layer after the second dual-residual block of the architecture achieves the best performance on visual quality and image diversity. As illustrated in Fig. 1, our DiverGAN equipped with a fully-connected layer is capable of generating birds with different visual appearances of footholds, background colors, orientations and shapes on the CUB bird data set [13].
We perform comprehensive experiments on three benchmark data sets, i.e., Oxford-102 [14], CUB bird [13] and MS COCO [15]. Both quantitative and qualitative results demonstrate that the proposed DiverGAN has the capacity to synthesize impressively better images than current single-stage and multi-stage models including StackGAN++ [4], MSGAN [12], SDGAN [8], DMGAN [7], DF-GAN [1], Fig. 1. Diversity comparison between DF-GAN [1] and our DiverGAN based on a single text description at the input. DF-GAN (Left) tends to suppress the random latent code, synthesizing slight variations of a bird. We propose DiverGAN (Right), an efficient and effective framework that is able to avoid the lack-of-diversity issue and yield diverse and high-quality pictures.
DTGAN [10]. The contributions of this work can be summarized as follows: We establish a novel single-stage architecture for the text-toimage synthesis task. Our framework mitigates the lack-ofdiversity issue, producing diverse and high-resolution samples that are semantically correlated with textural descriptions.
CAdaILN is introduced to stabilize training as well as help modulation modules flexibly control the amount of change in shape and texture.
Two new types of word-level attention modules are designed to reinforce visual feature maps with word-context vectors.
Finally, as an essential step, we introduce a flattening, fullyconnected layer into the single-stage pipeline to deal with the lack-of-diversity issue in text-to-image generation.
The remainder of this paper is organized as follows. In Section 2, we review related works. Section 3 briefly introduces the basic theories of a GAN, a CGAN, mode collapse and dot-product attention. In Section 4, we describe the proposed DiverGAN in detail. Section 5 reports the experimental results and the paper is summarized in Section 6.

Related works
In this section, research fields correlated with our work are described, including overcoming mode collapse in the GAN and the CGAN, CGAN-based text-to-image synthesis and attention mechanisms.

Avoiding mode collapse
Mode collapse is a common but crucial obstacle existing in the GAN. There are two major directions to mitigate the lack-ofdiversity and mode-collapse issues in a generative adversarial network (GAN) [16]. Some publications suggest adapting the objective strategy to optimize the discriminator and the training process, e.g., by employing a minibatch discriminator [17], spectral regularization [18] or unrolled optimization [19]. Metz et al. [19] proposed Unrolled GAN introducing a new objective to make the generator be updated through an unrolled optimization of the discriminator. Salimans et al. [17] developed a minibatch discriminator to check multiple output samples in a minibatch and a novel generator objective to tackle the overtraining on the discriminator. Liu et al. [18] recently proposed the spectral regularization for handling mode collapse. They suggest that the optimization of the discriminator is associated with the spectral distributions of the weight matrix.
Still other papers suggest the use of auxiliary structures into a GAN to encourage the generator to explore more modes of the true data distribution, including an autoencoder [20][21][22][23], multiple generators [24,25], a conditional augmentation technique [3], etc. Che et al. [20] presented a mode regularized GAN (ModeGAN) incorporating the autoencoder into the standard GAN to avoid the mode missing problem. Motivated by ModeGAN, VEEGAN [26] built a reconstruction net to map the true data distribution to the latent codes so that the generator network is able to synthesize all the data modes. EBGAN [21], BEGAN [22] and VAEGAN [23] also made efforts towards combining the GAN with the autoencoder. To combat the mode-collapse issue, Ghosh et al. [24] provided a multiagent GAN architecture (MAD-GAN), where multiple generators are employed to capture different modes and one discriminator is designed to identify generated samples.
There are also several works for mitigating the lack-of-diversity problem in the CGAN. Mao et al. [12] presented a regularization term to encourage the generator to explore more minor modes and force the discriminator to concentrate on the instances from the minor modes. To produce diverse instances from the textual description, Zhang et al. [3] developed a conditional augmentation technique (CA) concatenating a latent code and a textual variable sampled from the conditional Gaussian distribution as the inputs of the generator.
Nonetheless, these approaches either require exorbitant computational cost or are invalid for a single-stage architecture. Here, we take the multi-generator model [24] as the example. Assuming that various single-stage generators are exploited to learn all modes of the true data distribution to overcome mode collapse, we expect that different generators are able to fit diverse modes. However, in practice, the generators are still prone to falling into a singular mode for text-to-image synthesis, since each generator only pays attention to the same conditional context. We would like to propose to distinguish three levels of performance for a GAN : Level 1: Traditional mode collapse [17,19,11,27,26,20,21] strange, inappropriate patterns become a point attractor in the non-linear cyclic process [28] between the generator and the discriminator (cf. Fig. 6 in [17]); Level 2: Light mode collapse [12,29,3,30,31] -patterns from the training set become wholly or partially an attractor for the generator. This is akin to lookup-table behavior made possible by the high number of parameters in a GAN. This constitutes the lack-of-diversity problem (cf. Fig. 1, left, this paper); Level 3: Desired functionality -generation of diverse patterns that are semantically consistent with the text probe as regards the foreground and that provide a believable, natural pattern in the background. This paper focuses on solving the lack-of-diversity issue, i.e., the light mode collapse, not the severe, traditional mode collapse.

CGAN in text-to-image generation
Thanks to the critical improvements in generative approaches, especially the GAN and the conditional GAN or CGAN, inspiring advances in text-to-image generation have been made possible. According to the number of the generators and the discriminators these methods exploit, we roughly group them into two categories.
Multi-stage. Zhang et al. [3,4] presented StackGAN and Stack-GAN++ to synthesize considerably compelling instances from the textual descriptions in a coarse-to-fine way. Qiao et al. [9] proposed MirrorGAN exploiting an image caption model to regenerate the text description from a fake sample in order to boost the semantic relevancy between textual contexts and visual contents. Zhu et al. [7] developed DMGAN which applied a dynamic memory module to improve the quality of initial images. Yin et al. [8] designed a Siamese structure and conditional Batch Normalization to implement semantic consistency of synthetic images. SegAttn-GAN [32] proposed to employ additional segmentation information for image refinement process such that the network is capable of yielding images with realistic quality.
Single-stage. Reed et al. [33] are the first to use the single-stage CGAN to generate images from detailed text descriptions. However, the quality of synthetic images is limited due to the simple structure and immature training techniques of the CGAN. Tao et al. [1] presented DF-GAN adopting a matching-aware zerocentered gradient penalty (MA-GP) loss to cope with the problems of the multi-stage architecture. Nonetheless, DF-GAN just made use of a fully-connected layer to merge the feature map and a sentence vector, lacking an efficient modulation mechanism. Zhang et al. [10] proposed DTGAN that introduced two novel attention modules and conditional normalization to generate highresolution and semantically consistent images. Despite its excel-lent performance, DTGAN still suffers from the lack-of-diversity issue, which will be shown in Section 5.3.4.

Attention mechanism
One vital property of our human visual system is that humans are capable of focusing more on the salient parts of an image and ignoring unimportant regions. Inspired by this, attention mechanisms are invented to guide the network to concentrate on the most discriminative local features and filter out irrelevant information. Thanks to their advantages over traditional methods with respect to feature processing and feature selection, attention mechanisms have been extensively explored in a series of computer-vision fields, such as automatic segmentation [34,35], image fusion [36], object tracking [37], video deblurring [38], scene classification [39], image super-resolution [40] and action recognition [41].
There have been attention ways in text-to-image synthesis, since attention mechanisms play an essential role in bridging the semantic gap between vision and language. On the one hand, Xu et al. [5] utilized a spatial-attention mechanism to derive the relationship between the image subregions and the words in a sentence. The most relevant subregions to the words were particularly focused. On the other hand, Li et al. [6] designed a channel-wise attention mechanism on the basis of Xu et al. [5]. However, the aforementioned works adopt the weighted sum of converted word features as the new feature map which is largely different from the original feature map. Moreover, they both equally treat all words playing different roles in generating samples.
As opposed to them, we target to modulate per-scale feature map with word features across both channel and spatial dimensions, while also retaining the basic features to some extent for the purpose of improving image quality and stable the learning of the CGAN. In the meantime, we model the importance of each word to emphasize the salient words (e.g., adjectives and nouns) in the given text description. The experimental results conducted on three benchmark data sets validate the effectiveness of our proposed attention modules compared to the prior methods.

Generative adversarial network (GAN)
A GAN comprises two nets: the generator network G and the discriminator network D, which are perceived as playing a minmax zero-sum game. To be concrete, the aim of G is to capture the distribution of real data while yielding plausible images to trick D, whereas D is optimized to classify a sample as real or fake. Concurrently, G and D are typically implemented by deep neural networks. Mathematically, the minmax objective V G; D ð Þ for the GAN can be denoted as follows: where p data x ð Þ and p z z ð Þ represent the distributions of true data x and the random latent code z, respectively.
G tries to minimize the objective V during the minmax twoplayer game, while D aims to maximize it. This minmax zerosum game is finished when the distribution of produced samples entirely overlaps p data x ð Þ.

Conditional generative adversarial network (CGAN)
As an extension of the GAN, a CGAN takes the conditional contexts c (e.g. class labels, text descriptions and low-resolution images) and the random code z as the inputs of G, while also outputting the samples which are correlated with c. In the meantime, D entails distinguishing the real pair x; c ð Þ from the fake pair G z; c ð Þ; c ð Þ . Concretely, the objective V G; D ð Þfor the CGAN is formulated as: where p c c ð Þ denotes the distribution of c. In the task of text-to-image synthesis, c is the given textual description. Let denote a set of n image-text pairs for training, where I i represents an image and indicates a suite of k natural-language descriptions. The goal of G is to produce a visually plausible and semantically consistent sample b I i according to a text description c i randomly picked from C i , where

Mode collapse
Mode collapse is a phenomenon where the model fails to learn all the modes of the true data distribution, and thus generated samples lack diversity [11]. In addition, for the CGAN, the produced instances from a single-condition context seem identical [12]. One of the main reasons for the lack-of-diversity issue is that the generator merely visits a part of the real data distribution and misses a few modes, limited by the generative capability of the network [42]. Although the generator tries to map random latent codes into the original data distribution, such a map is not surjective [43]. When the designed model is not powerful, the network can only capture a few modes of the real data distribution and tends to map different inputs into these modes.
The lack-of-diversity issue becomes worse in the single-stage text-to-image pipeline. Generally, in the single-stage framework, a random latent code is taken as the input which is responsible for variations, while textual features serve as side information to modulate the visual feature map and determine the main visual contents. To boost the semantic consistency of generated instances, the generator usually leverages modulation modules for the per-scale feature map, which may make the network tend to concentrate on the textual-context features that are highdimensional and structured, and ignore a low-dimensional latent code. In this work, we aim to enhance the generative ability of the network to reinforce the control of a random latent code over the visual feature map for the purpose of improving diversity.

Scaled dot-product attention
Generally, an attention function receives a query q i 2 R d and a series of key-value pairs as the inputs and outputs a weighted sum of the values, where the attention weight on each value is calculated via a softmax function on the dot products of the query with all keys. The process of generating attention weights is formulated as follows: where D Á ð Þ denotes the dot-product operation and Softmax repre-

The proposed approach
In this section, we discuss the overall architecture of DiverGAN, depicted in Fig. 2. After that, two novel types of attention modules, including a channel-attention module (CAM) and a pixel-attention module (PAM), are introduced to modulate the visual feature map with word-context embeddings. Subsequently, we describe the proposed Conditional Adaptive Instance-Layer Normalization (CAdaILN) leveraging the linguistic cues derived from the sentence vector to flexibly manipulate the amount of change in shape and texture. Fig. 2 shows the overall architecture of DiverGAN. The first layer exploits a fully-connected layer to process a latent code z 2 R 100 and reshapes the result as the initial feature map F 0 2 R 4Â4Â256 . After that, F 0 is taken into the basic generator module (see Fig. 2 (b)) that mainly comprises seven dual-residual blocks (see Fig. 2 (c)) receiving textual embeddings derived from a text encoder as extra conditional contexts to strengthen the visual feature map. Fig. 2(c) shows the details of our modified dual-residual module. Our designed dual-residual block consists of two residual modules (see Fig. 2(d)), each of which is constructed with a set of convolutional layers, CAdaILN, ReLU activation functions followed with modulation modules (i.e., a channel-attention module (CAM) and a pixel-attention module (PAM)) taking the wordcontext features as side information to modulate the visual feature map. Furthermore, we plug a stack of a convolutional layer and CAdaILN activated by ReLU between two residual modules to relieve the control of contextual information. The dual-residual block not only accelerates convergence speed by retaining more original visual features than cascade convolutions, but also enables us to easily increase the depth of the network, efficiently improving network capacity and resulting in high-quality images with more details.

Overall architecture
To deal with the lack-of-diversity issue, we conduct a series of experiments on structure design and try to reinforce the control of the random latent code over the visual feature map to enhance variants. We observe that a dense layer can significantly boost the generative ability of the network, encouraging the model to explore minor modes of the true data distribution. For this reason, we attempt to insert one linear layer into our pipeline. The experimental results demonstrate that embedding a fully-connected layer after the second dual-residual block achieves appealing performance on visual quality and image diversity. More specifically, the output feature map F 2 2 R 8Â8Â256 of the second dual-residual block is first resized to R 16384 and put into a linear layer that maintains the dimension of the input features. Afterwards, F 2 is reshaped to R 8Â8Â256 again and passed to the next dual-residual block. Then, the output of the basic generator module (see Fig. 2 (b)) is sent to one convolutional layer activated by tahn function to generate a final sample.
Why does inserting a dense layer address the lack-ofdiversity issue? By inserting a fully-connected layer, the network cannot exploit the spatial 2D layout of the preceding feature maps and needs to encode all the necessary information in a single 1D vector (embedding) as the basis for an unfolding in 2D by the later layers. As a result, we will have a representation at this point in the architecture that lends itself for injection of random noise with a subsequently increased diversity in the generated patterns: Because of the 1D bottleneck, the network cannot easily replicate (partial) 2D patterns from the early feature maps. This avoids the 'lookup-table' property that many GAN architectures have.
Why does embedding a fully-connected layer after the second residual block achieve the best variety and quality? If the dense layer is too early in the network, it obtains early, crude featural information that is insufficient to generate semantically consistent output patterns. If you have an early feature map representing small bird components, the network will not be able to assemble a bird. On the other hand, if the dense layer is too late in the network, it will be fed by almost-complete patterns, and there will be a lack of diversity, since the system will operate as a lookup-table for the patterns in the training set that can then only be modified marginally by the last few layers. Additionally, it is extremely difficult to insert a dense layer after the third or later residual blocks due to the limited memory on GPUs. This can only be determined empirically.
To the best of our knowledge, we are the first to propose this kind of text-to-image architecture introducing one linear layer to improve the power of a random latent code to produce images of diverse modes. We expect that DiverGAN can provide a strong basis for the future developments of text-to-image generation.

Dual attention mechanism
It is well known that the semantic affinities between conditional contexts and visual feature maps are particularly critical for image synthesis. However, this correlation will become more complicated for text-to-image generation, since the given sentence contains a suite of words which have different contributions to synthetic samples. For instance, the adjective in the input text description will attend more to the produced sample than the definite article ''the". Moreover, although our dual-residual structure is beneficial for model capacity and training stability, it may bring noise and redundant information. For these reasons, two new types of word-level attention modules, termed as a channel-attention module (CAM) and a pixel-attention module (PAM), are designed to explore the latent interplay between word-context features and visual feature maps. CAM and PAM have the capacity to identify the significant words (e.g., adjectives and nouns) in the given text description and make the network assign more weights to the crucial channels and pixels semantically associated with these words. In addition, they can alleviate the effects of semantically irrelevant and redundant features from both channel and position perspectives.

Channel-attention module (CAM)
Channel maps have different responses to the words in the given sentence, and thus, the channels that respond to the prominent words in the sentence deserve more attention from the network. Here, we propose CAM to model the importance of each word while assigning larger weights to more useful channels semantically matching with the salient words. Fig. 3 illustrates the detailed structure of CAM. Given a feature map F c 2 R HÂWÂC (where H; W and C denote the height, the width and the channel number of F c , respectively), we first adopt the global average pooling and max pooling to process it to aggregate holistic and discriminative information, thereby deriving two channel feature vectors F ca 2 R 1Â1ÂC and F cm 2 R 1Â1ÂC . After that, F ca and F cm are fed into two different 1 Â 1 convolution layers rather than one 1 Â 1 convolution operation, since average pooling and max pooling acquire different globally spatial statistics [44]. The outputs are then converted to R 1ÂC and repeated C times along dimension 1 to obtain two features: the average-pooling query F caq 2 R CÂC and the max-pooling query F cmq 2 R CÂC , respectively. Mathematically, where Avg and Max represent the global average pooling and max pooling, respectively. fc aq and fc mq denote 1 Â 1 convolution layers and f re refers to the reshape and repeat operations. For word-context vectors E 2 R MÂT (where M denotes the dimension of the word embeddings and T denotes the number of the words in the given text description), we flow them into two different 1 Â 1 convolution operations followed by ReLU activation to produce two contextual vectors: the key F ck 2 R CÂT and the value F cv 2 R CÂT , which are in the common semantic space of the visual features. Next, we compute the mean of the value F cv along the dimension 2 and resize it into R 1ÂC . Meanwhile, we multiply the result and the value F cv to gain the contextual attention map E ci 2 R 1ÂT that indicates the importance of each word in the sentence. Intuitively, a larger value in the attention map means that the corresponding word attends more to the synthetic image. The acquisition of the contextual attention map is formulated as: Fig. 3. Overview of the proposed channel-attention module, which aims to assign larger weights to more useful channels semantically matching with the salient words. AvgPool and MaxPool refer to the global average pooling and max pooling, respectively. H; W and C denote the height, the width and the channel number of the visual feature map, respectively. M is the dimension of the word embeddings and T is the number of the words in the given text description. 1 Â 1 conv indicates the 1 Â 1 convolution operation.
where fc ck and fc cv refer to 1 Â 1 convolution layers followed with ReLU activation. fc mean represents the average and reshape operations. Afterwards, to model the semantic affinities between wordcontext features and channels, we conduct a dot-product operation between the queries and the key F ck 2 R CÂT , and apply a softmax function to get the contextually channel-wise atention matrix CA 2 R CÂT indicating the similarity weights between channels and words in the sentence. At last, the channel-attention weights W c are calculated through a softmax function on the dot products of CA with the transpose of E ci , and resized into R 1Â1ÂC . We formulate a series of operations as: where D Á ð Þ denotes the dot-product operation and Softmax represents the softmax function. CA a 2 R CÂT and CA m 2 R CÂT refer to the contextually channel-wise attention matrixes of F caq and F cmq , respectively. f re indicates the reshape operation. W ca 2 R 1Â1ÂC and W cm 2 R 1Â1ÂC are the channel-attention weights of CA a and CA m , respectively.
After acquiring the attention scores of channels, we multiply them and the original feature map to re-weight the visual feature map. By doing so, the network will focus more on useful channels of the feature map and assign larger weights on them. At the same time, we design an adaptive gating method to dynamically merge the output feature maps of the global average pooling and max pooling. The process of getting the fused feature map F cw 2 R HÂWÂC is described as: where is the element-wise multiplication. F caw and F cmw denote the rescaled feature maps. g ca represents the response gate for the fusion of visual feature maps. f re refers to the reshape operation that resizes F ca and F cm into R 1ÂC . fc ga and fc gm represent fully-connected layers that reduce the number of channels to 1 and r denotes the sigmoid function.
To retain the basic features and stabilize the learning of the CGAN, we further apply an adaptive residual connection [45] to synthesize the final result F cu 2 R HÂWÂC . It is defined as follows: where c c is a learnable parameter which is initialized as 0.

Pixel-attention module (PAM)
As discussed above, CAM re-weights the visual feature map from the perspective of channel. However, in addition to the channels in the image feature map, the visual pixels are of central importance for the quality and semantic consistency. Hence, PAM is presented to effectively capture the spatial interplay between visual pixels and word-context vectors, allowing the significant pixels to gain more attentions from the network. Noteworthily, PAM is performed on the output of CAM, since the visual pixels in the same channel of the feature map still share the same weights.
The framework of PAM is depicted in Fig. 4. For a feature map F s 2 R HÂWÂC , we first perform the average pooling and max pooling in the spatial dimension over it to distill global features, acquiring two spatial feature vectors F sa 2 R HÂWÂ1 and F sm 2 R HÂWÂ1 . Then, F sa and F sm are resized into R HÂW ð Þ Â 1 and repeated C times along dimension 2 to obtain two matrices: the average-pooling query F saq 2 R HÂW ð Þ Â C and the max-pooling query F smq 2 R HÂW ð Þ Â C , respectively. The process of obtaining the queries is formulated as follows: where Avg and Max denote the average pooling and max pooling in the spatial dimension, respectively. f re refers to the reshape and repeat operations. Given word-context features E 2 R MÂT , we employ the same way as CAM to process them, producing the key F sk 2 R CÂT , the Fig. 4. Overview of the proposed pixel-attention module, which aims to capture the spatial relationships between pixels and word embeddings, and allow significant pixels to acquire more weights. AvgPool and MaxPool denote the average pooling and max pooling in the spatial dimension, respectively. H; W and C denote the height, the width and the channel number of the visual feature map, respectively. M is the dimension of the word embeddings and T is the number of the words in the given text description. 1 Â 1 conv indicates the 1 Â 1 convolution operation. value F sv 2 R CÂT and the contextual attention map E si 2 R 1ÂT . Specifically, where fc sk and fc sv refer to 1 Â 1 convolution layers followed with ReLU activation. fc mean represents the average and reshape operations.
After that, the spatial-semantic attention map SA 2 R HÂW ð Þ Â T is achieved via a softmax function on the dot products of the queries with the key F sk 2 R CÂT , indicating the similarity weights between visual pixels and words in the sentence. Subsequently, we conduct a dot-product operation between SA and the transpose of E si , and leverage a softmax function to obtain the pixel-wise attention weights that are converted to R HÂWÂ1 . The acquisition of the pixel-wise attention weights W s 2 R HÂWÂ1 is denoted as follows: where D Á ð Þ denotes the dot-product operation and Softmax represents the softmax function. SA a 2 R HÂW ð Þ Â T and SA m 2 R HÂW ð Þ Â T refer to the contextually spatial-attention maps of F saq and F smq , respectively. W sa 2 R HÂWÂ1 and W sm 2 R HÂWÂ1 are the pixel-wise attention weights of SA a and SA m , respectively.
Next, same as CAM, we perform a matrix multiplication between the pixel-wise attention scores and the original feature map to facilitate the visual feature map. Meanwhile, in order to maintain the features, we concatenate the rescaled feature maps of the average pooling and max pooling, and put the result into a 1 Â 1 convolution layer followed with a ReLU function to generate the merged feature map F sw 2 R HÂWÂC . Then, an adaptive residual connection is adopted to get the final output F su 2 R HÂWÂC . This process is described as: where is the element-wise multiplication. F saw and F smw denote the rescaled feature maps. ; ½ refers to the concatenation operation along the channel dimension and fc con represents the 1 Â 1 convolutional operation followed with a ReLU function. c s is a learnable parameter which is initialized as 0.
Why do our attention modules work better? Firstly, we model the importance of each word to emphasize the salient words (e.g., adjectives and nouns) in the given text description. This will make our generator concentrate more on the semantically related parts of the image. Secondly, in order to achieve better quality, we utilize average pooling and max pooling to acquire different globally spatial and channel information. Thirdly, an adaptive gating method is designed to dynamically merge the output feature maps of average pooling and max pooling, obtaining better performance. Fourthly, we specifically spread our attention weights to all the channels and pixels to enhance feature maps in several (!) layers, while applying an adaptive residual connection to synthesize the final result, in order to retain the basic features and stabilize the learning of the CGAN. We have not seen this in literature. Fifthly, different from the previous methods, we leverage our attention modules to modulate per-scale feature map across both channel and spatial dimensions. Sixthly, our proposed CAdaILN can help with flexibly controlling the amount of change in shape and texture, complementing our attention modules.

Conditional Adaptive Instance-Layer Normalization (CAdaILN)
In order to stabilize the training of the GAN [16], most existing text-to-image generation models [5][6][7]9,8] employ Batch Normalization (BN) [46] applying the normalization to a whole batch of generated images instead for single ones. However, the convergence of BN heavily depends on the size of a batch [47]. Furthermore, the advantage of BN is not obvious for text-to-image generation, since each synthetic image is more pertinent to the given text description and the feature map itself. To this end, CAdaILN is designed to perform the normalization in the layer and channel of the feature map f and modulate the normalized feature mapf with the linguistic cues captured from the global sentence vector s, illustrated in Fig. 5. More concretely, we employ two fully-connected layers W 1 and W 2 to transform the sentence vector s into the linguistic cues c 2 R 1Â1ÂC and b 2 R 1Â1ÂC . Moreover, we normalize the visual feature map with Instance Normalization (IN) and Layer Normalization (LN). After that, the normalized feature mapf is acquired via the adaptive sum of the IN outputâ I and the LN outputâ L . Afterwards, we leverage c and b to scale and shiftf . The process of CAdaILN is formulated as follows: where the ratio of IN and LN is dependent on a learnable parameter q 2 R 1Â1ÂC , whose value is constrained to the range of [0, 1]. Moreover, q is updated together with generator parameters.
Notice that our proposed dual-attention mechanisms and CAdaILN are all easy-to-implement methods, although they seem complicated.

Experiments
In this section, to prove the effectiveness of the proposed Diver-GAN in diversity and producing visually realistic and semantically consistent images, we perform a wealth of quantitative and qualitative evaluations on three benchmark data sets, i.e., Oxford-102 [14], CUB bird [13] and MS COCO [15]. To be specific, we clarify the details of experimental settings in Section 5.1. After that, the proposed DiverGAN is compared to previous CGAN-based approaches for text-to-image generation in Section 5.2. Subsequently, we analyze the contributions from different components of our DiverGAN in Section 5.3.
Oxford-102. The Oxford-102 data set includes 5,878 and 2,311 images for training and testing, respectively. Each image is accompanied by 10  Implementation details. For the text encoder, following the method of AttnGAN [5], we utilize a pretrained bidirectional Long Short-Term Memory network [48] to acquire the word embeddings and the global sentence vector, respectively. We adopt the losses in DTGAN [10] owing to its superior results. We set the dimension of the latent code to 100. As for the training, we leverage the Adam optimizer [49] with b ¼ 0:0; 0:9 ð Þto train our network. We also follow the two time-scale update rule (TTUR) [50] and set the learning rates for the generator and the discriminator to 0.0001 and 0.0004, respectively. The batch size is set to 16. Our DiverGAN is implemented by PyTorch [51]. All the experiments are performed on a single NVIDIA Tesla V100 GPU (32 GB memory).
Evaluation metrics. Our DiverGAN is evaluated by calculating three widely used metrics including an inception score (IS) [17], a Fréchet inception distance (FID) [54] score and a learned perceptual image patch similarity (LPIPS) [55] score. The first two are to measure the visual quality, and the last one is employed to assess the diversity of generated samples.
IS. The IS is acquired via the KL divergence between the conditional class distribution and the marginal class distribution. It's defined as: where x is a generated sample and y is the corresponding label obtained by a pre-trained Inception v3 network [54]. The produced samples are split into multiple groups and the IS is calculated on each group of images, then the average and standard deviation of the score are reported. Higher IS demonstrates better quality among the generated images.
For the COCO data set, DTGAN [10], DFGAN [1] and ObjGAN [56] argue that the IS fails to evaluate the synthetic samples and can be saturated, even over-fitted. Consequently, we do not compare the IS on the COCO data set.
FID. The FID computes the Fréchet distance between the distribution of generated samples and the distribution of true data. A lower FID means that the synthetic samples are closer to the corre-   sponding real images. We use a pre-trained Inception v3 network to achieve the FID. For the Oxford-102 data set, we do not list the FID due to lack of compared scores. It should be noted that we synthesize 30000 pictures from unseen textual descriptions for the IS and FID.
LPIPS. LPIPS measures diversity by computing the average feature distance between synthetic images. The generated samples are meant to be diverse if the LPIPS score is large. The results of LPIPS will be discussed in Section 5.3.4

5.2.
Comparison with state-of-the-art CGAN-based methods

Quantitative results
We compare our method with previous single-stage [1,10] and multi-stage [3-9,32] CGAN-based approaches on the CUB, Oxford-102 and COCO data sets. The IS of our DiverGAN and other compared models on the CUB and Oxford-102 data sets are reported in Table 1. We can see that our DiverGAN achieves the best perfor-mance, significantly increasing the IS from 4.88 to 4.98 on the CUB data set and from 3.77 to 3.99 on the Oxford-102 data set. The experimental results indicate that our DiverGAN is capable of producing perceptually plausible pictures, with higher quality than state-of-the-art approaches.
The comparison between our DiverGAN, StackGAN++ [4], Attn-GAN [5] and DF-GAN [1] with respect to the FID on the CUB and COCO data sets is shown in Table 2. We can observe that our Diver-GAN obtains a remarkably lower FID than compared methods on both data sets, which demonstrates that our generated data distribution is closer to the true data distribution. More specifically, we impressively reduce the FID from 28.92 to 20.52 on the challenging COCO data set and from 19.24 to 15.63 on the CUB data set.

Qualitative results
In addition to quantitative comparison, we conduct qualitative experiments on the CUB, Oxford-102 and COCO data sets, which are illustrated in Fig. 6 and Fig. 7. , DF-GAN [1] and our DiverGAN on the COCO and CUB data sets, indicating that our DiverGAN has the capacity to synthesize high-quality and semantic-consistency pictures conditioned on the text descriptions. For instance, in terms of complex scene generation, Diver-GAN synthesizes a red bus with more vivid details than DF-GAN and DM-GAN in the column 1 of Fig. 6(a). It can also be seen that DiverGAN produces a kitchen with a plausible wooden counter (2 nd column), a clear red sign on the road (3 rd column), fresh vegetables and fruits with rich color distributions (4 th column), a man surfing on the realistic sea waves (5 th column) and a beautiful night street (6 th column), whereas DM-GAN and DF-GAN both yield unclear objects (1 st ; 2 nd ; 3 rd ; 7 th and 8 th column) and the background with a single color distribution (4 th ; 5 th and 6 th column). More importantly, DiverGAN creates an impressive European clock tower in the column 7. The above results demonstrate that our DiverGAN equipped with the dual-residual structure is capable of capturing the crucial words in the sentence and highlighting the main objects of the image, generate a high-quality multi-object scene with vivid details.
As can be observed in Fig. 6(b), DM-GAN, DF-GAN and Diver-GAN all yield promising birds with consistent colors and shapes, but our method better concentrates on the semantically related parts of the image, synthesizing perceptually realistic birds. In addition, some backgrounds synthesized by DM-GAN (1 st ; 2 nd ; 5 th and 7 th column) and DF-GAN (5 th ; 7 th and 8 th column) are not plausible. It indicates that, with word-level attention modules, our model is able to bridge the gap between visual feature maps and word-context vectors, producing high-quality samples which semantically align with the text descriptions. The qualitative comparison of DF-GAN, DTGAN [10] and DiverGAN on the Oxford-102 is depicted in Fig. 7. We can observe that our approach synthesizes visually plausible flowers with more vivid details, richer color distributions and more clear shapes than DF-GAN and DTGAN, which confirms the effectiveness of DiverGAN.

Human evaluation
We perform a human test on the CUB and COCO data sets, so as to evaluate the image quality and the semantic consistency of DM-GAN [7], DF-GAN [1] and our DiverGAN. We randomly select 100 images from both data sets, respectively. Given the same text descriptions, users are asked to choose the best sample synthesized by three approaches according to the image details and the corresponding natural-language description. In addition, the final scores are computed by two judges for fairness. As illustrated in Fig. 8, our method impressively outperforms DM-GAN and DF-GAN on both data sets, especially on the challenging COCO data set, which demonstrates the superiority of our proposed DiverGAN.

Ablation studies of the proposed approach
In order to evaluate the contributions from different components of our DiverGAN, we conduct extensive ablation studies on the CUB, Oxford-102 and COCO data sets. The novel components in our model include a channel-attention module (CAM), a pixelattention module (PAM), CAdaILN, a dual-residual block and the insertion of a fully-connected layer (FC) between the first and the second residual block. We first quantitatively explore the effec-   we do not delete the dual-residual block in our ablation studies, since it is the basic structure in DiverGAN. All the results are reported in Table 3.
By comparing M1 (DiverGAN) with M2 (removing the FC), the introduction of the FC remarkably enhances the IS from 4.91 to 4.98 on the CUB data set and from 3.87 to 3.99 on the Oxford-102 data set, and reduces the FID from 16.42 to 15.63 on the CUB data set and from 22.53 to 20.52 on the challenging COCO data set. The experimental results demonstrate the importance of adopting a linear layer in DiverGAN. By exploiting CAdaILN in our DiverGAN, M1 performs better than M3 (removing CAdaILN), confirming the effectiveness of the proposed CAdaILN. To be specific, the IS is improved by 0.62 on the CUB data set and 0.53 on the Oxford-102 data set, and the FID is decreased by 8.54 on the CUB data set and 6.28 on the COCO data set. Furthermore, M1 achieves better results than both M4 (removing CAM) and M5 (removing PAM), which indicates that these two new types of word-level attention modules can help the generator yield more realistic images.

Effectiveness of the attention modules
To prove the effectiveness of our CAM and PAM, we also explore the performance of DiverGAN with other attention modules in text-to-image generation methods on the CUB and Oxford-102 data sets. Concretely, we replace the CAM and PAM in each residual block of M2 (DiverGAN removing the FC) with the attention modules in AttnGAN [5] and ControlGAN [6], respectively. Table 4 displays the comparable results. It can be observed that our proposed attention mechanisms outperform AttnGAN and ControlGAN, improving the IS by 0.28 on the CUB data set and 0.14 on the Oxford-102, and reducing the FID from 20.47 to 16.42 on the CUB data set, which verifies the superiority of our proposed CAM and PAM.
The reason behind this result may be that the prior attention modules directly convert semantic vectors to visual feature maps, adopting the weighted sum of converted word features as the new feature map, which is largely different from the original feature map. However, our attention modules aim to strengthen the visual feature map according to the contextual-semantic relevance while preserving the basic features to some extent. Additionally, CAM and PAM can emphasize the salient words in the given sentence instead of equally treating all words. Noteworthily, with wordlevel attention modules, our DiverGAN also has the ability to manipulate the parts of generated samples, which we detail in Section 5.3.4.

Effectiveness of the proposed CAdaILN
To further verify the benefits of CAdaILN, we conduct an ablation study on normalization functions. We first design a baseline model by removing CAdaILN from DiverGAN (M3). Then, we compare the variants of normalization layers. Note that BN conditioned on the global sentence vector (BN-sent) and BN conditioned on the word vectors (BN-word) are based on semantic-conditioned Batch Normalization in SDGAN [8] and the CAdaILN function with the word vectors (CAdaILN-word) is achieved through the word-level normalization method in SDGAN. The results of the ablation study are shown in Table 5. It can be observed that by comparing B2 with B4 and B3 with B5, CAdaILN significantly outperforms the BN whether using the sentence-level linguistic cues or the wordlevel linguistic cues. Moreover, by comparing B5 with B4, CAdaILN with the global sentence vector performs better than CAdaILNword by improving the IS from 4.71 to 4.91 on the CUB data set and from 3.73 to 3.87 on the Oxford-102 data set, and reducing the FID from 17.19 to 16.42 on the CUB data set and from 23.79 to 22.53 on the COCO data set, since sentence-level features may be easier to train in our generator network than word-level features. The above analysis demonstrates the effectiveness of our designed CAdaILN.

Effect of the number of residual blocks in the residual-structure
To evaluate the impact of different number of residual blocks in the dual-residual structure, we compare the performance of a residual block (single-block) and two residual blocks (dual-block) on the CUB and Oxford-102 data sets, as depicted in Table 6. By comparison with single-residual block, the proposed dualresidual block enhances the IS by 0.25 on the CUB data set and 0.12 on the Oxford-102 data set, and decreases the FID by 2.16 on the CUB data set, which shows the effectiveness of the dualresidual structure.

Effectiveness of the fully-connected layer in image diversity
To validate the effectiveness of the dense layer in diversity, we perform quantitative evaluation on the CUB, Oxford-102 and COCO data sets, and qualitative comparison on the CUB data set.
Quantitative results. The quantitative comparison is divided into three groups according to the goals. The first group is designed to evaluate the quality and diversity of samples synthesized by single-stage methods (i.e., DFGAN and DTGAN); StackGAN++ and MSGAN, which propose conditioning augmentation (CA) and a mode seeking regularization term (MS) to enhance the image diversity, respectively. The second group aims to compare our proposed fully-connected (FC) method with the CA and MS. For fair comparisons, we take the model M2 (DiverGAN removing the FC) in Table 3 as the baseline model (F5). F6 (F5 + CA) and F7 (F5 + MS) add the CA and MS to F5 for comparison, respectively. In the last group (i.e., F8 (F5 + FC(1)), F9 (F5 + FC(1+1)) and F10 (Diver-GAN)), we verify the effectiveness of the different ways the linear layer is inserted into F5. F8 and F9 indicate the models that plug one and two dense layers after the first residual block of F5, respectively.
The quantitative results are reported in Table 7. By comparing F10 (DiverGAN) with F1 (StackGAN++), F2 (MSGAN), F3 (DF-GAN) and F4 (DTGAN), our DiverGAN improves the LPIPS from 0.544 to 0.682 on the CUB data set, confirming the superiority of our Diver-GAN in diversity. By comparison with F5, F7 enhances the LPIPS by 0.118 on the CUB data set, whereas it decreases the IS by 0.34 on the CUB data set and 0.05 on the Oxford-102 data set, and increases the FID by 2.06 on the CUB data set and 1.91 on the COCO data set. Table 3 Ablation study of our DiverGAN. CAM, PAM and FC represent the channel-attention module, the pixel-attention module and the insertion of the fully-connected layer between the first and the second residual block, respectively. The best results are in bold.  (1)), F9 (F5 + FC(1+1)) and F10 (DiverGAN) is presented in Fig. 9. DF-GAN + FC(2) and DTGAN + FC(2) refer to DF-GAN and DTGAN plugging a linear layer after the second residual block of the architecture, respectively.

ID
We can observe that, although DF-GAN (1 st row) and DTGAN (2 nd row) both perform well in quality, the shapes of the synthetic birds look similar and the background colors are the same. However, after inserting a dense layer into the framework, DF-GAN (3 rd row) and DTGAN (4 th row) both yield more diverse birds (e.g., different shapes, orientations and even background colors) than the original frameworks, demonstrating the generalizability of our proposed method. The reason behind similar backgrounds for DF-GAN + FC(2) may be that the background colors in DF-GAN only depend on the textual embeddings due to the introduction of the modulation module. By comparing our DiverGAN (9 th row) with F6 (F5 + CA) and F7 (F5 + MS), we can see that a linear layer contributes to producing diverse birds with vivid details, whereas CA does not improve the diversity of birds and MS may affect the image quality. For example, in the row 9, the birds are on the branch or ground, and the orientations of birds, background colors and the visual appearances of footholds are different. Nonetheless, the birds in the row 5 still have similar shapes and the same background color, while the birds in the row 6 look a little blurry. By comparison with F5, we can observe that F8 (F5 + FC (1)), F9 (F5 + FC(1+1)) and DiverGAN all generate diverse samples,   Table 6 Effect of the number of residual blocks in the dual-residual structure. Single-block and dual-block indicate the residual structure with a single block and two residual blocks, respectively. The best results are in bold.

Table 7
Effectiveness of the fully-connected layer (FC) in image diversity. Baseline (F5) corresponds to model M2 -DiverGAN removing the FC -in Table 3. CA and MS indicate conditioning augmentation (CA) and a mode-seeking regularization term (MS), respectively. FC(1) and FC(1+1) represent the insertion of one and two fully-connected layers after the first residual block, respectively. The best results are in bold. further confirming the effectiveness of the FC in diversity. In addition, DiverGAN generates realistic birds with higher quality than F6 and F7, which validate the superiority of DiverGAN.

ID
To validate the sensitivity of our DiverGAN, we generate birds by modifying just one word in the given text description. As can be seen in Fig. 10, when we change the color attribute in the natural-language description, the proposed DiverGAN further produces semantically consistent birds according to the modified text while retaining visual appearances (e.g., shape, position and texture) correlated with the unmodified parts. Additionally, our method synthesizes a suite of birds with different visual appearances of footholds, background colors, orientations and shapes by changing latent codes. Therefore, with word-level attention modules and the FC method, DiverGAN is able to effectively disentangle attributes of the input text description while accurately controlling regions of the sample without hurting diversity.  9. Qualitative results of other compared approaches and our DiverGAN (bottom row) when given a single text description, on the CUB data set. DF-GAN + FC(2) and DTGAN + FC(2) refer to DF-GAN and DTGAN that plug a dense layer after the second residual block, respectively. F5, F6, F7, F8 and F9 represent the corresponding models in Table 7.

Interpolation of latent space in DiverGAN
To better understand how DiverGAN utilizes latent codes to achieve diversity, we conduct linear interpolation between two random latent codes and produce corresponding pictures. The interpolation results of DiverGAN are presented in Fig. 11. We can see that both the background and the visual appearances of footholds (1 st and 4 th row), both the shapes and the positions of birds (2 nd row), the orientations and the shapes of birds and the visual appearances of footholds (3 rd row) gradually change with  the variances of latent codes. Furthermore, we discover that Diver-GAN is likely to yielding a series of high-quality images when the samples conditioned on the first and the last latent codes are visually realistic. At the same time, interpolation results often look blurry if DiverGAN is not able to generate plausible samples according to these two latent codes. Therefore, we assume that DiverGAN may synthesize high-quality pictures based on the latent codes around a 'good' latent code enabling DiverGAN to yield a realistic sample. Our next challenge is to find such 'good' latent codes in order to facilitate the synthesization of large numbers of satisfactory results, e.g., for the purpose of data augmentation.

Conclusion
In this paper, we propose a unified, effective single-stage framework called DiverGAN for yielding diverse and perceptually realistic samples which are semantically related to given textual descriptions. DiverGAN exploits two new types of word-level attention modules, i.e., a channel-attention module (CAM) and a pixel-attention module (PAM), to make the generator concentrate more on useful channels and pixels that semantically match with the salient words in the natural-language description. In addition, Conditional Adaptive Instance-Layer Normalization (CAdaILN) is designed to adopt the linguistic cues to flexibly control the amount of change in shape and texture, strengthening visual-semantic representation and complementing modulation modules. Then, a dual-residual block is employed to accelerate convergence speed while enhancing image quality. Meanwhile, a fully-connected layer is introduced into our architecture to combat the lack-of-diversity problem by enhancing the generative ability of the network. Extensive experiments on three benchmark data sets show that our DiverGAN achieves remarkably better performance than existing methods in quality and diversity. Furthermore, our presented components (i.e., CAM, PAM and CadaILN) are general methods, and can be readily integrated into current text-to-image architectures to reinforce feature maps with textual-context vectors. More importantly, our proposed pipeline tackles the lack-of-diversity issue existing in the current single-stage methods, and can serve as a strong basis for developing better single-stage models. For future work, we will investigate how to produce plausible samples which are semantically correlated with text descriptions in an unsupervised way and how to yield a suite of high-quality pictures based on regions of good solutions in the latent space. One pervasive problem in the evaluation of the performance of imagegeneration methods is the degree of subjectivity involved. In case of human evaluations, more effort should be spent in guiding the human attention to aspects of the image: ''Is the content semantically consistent with the text probe?" but also: ''Is the background pattern believable/natural?". Therefore, in future research, we will extend the questionnaire for the human subjects to ask more precisely what they think about the backgrounds of generated samples.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.