基於注意力生成對抗網路的中文轉高解析度影像系統

近年來人工智慧的發展下，在影像生成方面有許多應用，例如影像合成、影像風格轉換、生成高畫質影像……等等，其中利用中文語句生成影像的應用相當稀少。此外，傳統的生成對抗網路在生成影像時往往會在細節上生成比較模糊的特徵，將注意力機制加入到生成對抗網路裡可以讓句子中的每個詞配對影像中每個細節，例如:顏色、身體結構，這樣的效果可以讓影像不是只有生成大概符合的影像而是生成真正清楚又符合語句的影像。因此本論文利用注意力機制結合生成對抗網路將中文描述的句子轉為高解析度的影像，本論文注意力生成對抗網路使用Google的InceptionResNetV2當作Image編碼器的架構與使用雙向GRU當作Text編碼器進行訓練，此模型總共有三個生成器與三個判斷器，三個生成器分別可以生成出64*64*3、128*128*3、256*256*3的影像，這樣的設計可以利用低畫質的影像生成高畫質的影像，三個判斷器則是分別對三個生成器進行評分，判斷器評分分為兩個部分，分別為判斷影像是否真實和影像是否符合句子的敘述，最後可以生成出具有高解析及高細部化的影像。本論文利用Inception scores進行評分，Inception scores可以判斷出模型是否具有生成多樣性及真實性的影像，實驗結果顯示，本論文之模型GraGAN在英文語句的Inception scores為4.35分、其他模型:AttnGan為4.33分、StackGanV2為4.08分，本論文之模型GraGAN在中文語句的Inception scores為4.31分，其他模型:AttnGan為4.26分，本論文經過調整Text編碼器和Image編碼器的調整，Inception Scores都比AttnGAN的分數高。而在中文語句部分的分數略低的原因為中文語句不能像英文語句一樣每個單字都有獨立的意義，例如:螳螂要螳跟螂組在一起才有意思，如果將兩個字分開的話是沒有意義的，但英文沒有這個問題，因此分數略低於使用英文語句的模型是可以接受的，StackGanV2為生成高解析度的圖片但在細部的地方比本論文的影像表現還要遜色一點，因此分數有明顯的低於本論文。本論文的貢獻在於將資料集轉換成為中文語句，透過中文語句前處理，並且加入雙向GRU的預訓練與InceptionResNetV2的編碼以降低訓練時間及運算量，實驗結果證明使用中文語句的資料集一樣可以在生成對抗網路這個系列的模型中取得良好的效果。

關鍵字

人工智慧；生成對抗網路；注意力機制；高解析度

並列摘要

In the era of artificial intelligence, many applications of image generation have been proposed, e.g., image synthesis, image style conversion, generating high-quality images. However, the application of using Chinese sentences to generate images is quite rare. In addition, conventional generative adversarial networks often generate vague features on details when generating images. Adding attention mechanism to the generation of adversarial networks can match every detail in the image from every word in the sentence, such as color, body structure. Such an effect can make the system not only generate an image that roughly conforms but to generate an image that is really clear and conforms to the sentence. Therefore, this thesis uses the attention mechanism combined with the generative adversarial network to convert the Chinese sentences into high-resolution images. A model called GraGAN is proposed in this thesis. GraGAN utilizes Google's InceptionResNetV2 as an image encoder architecture and bidirectional GRU as a text encoder for training, respectively. This model has a total of three generators and three discriminators. The three generators can generate images of 64*64*3 pixels, 128*128*3 pixels, and 256*256*3 pixels. This design can use low-quality images to generate high-quality images. The three discriminators evaluate the three generators separately. The discriminator evaluation is divided into two parts, which are to discriminate whether the image is clear and whether the image meets the sentence narrative. Finally, images with high resolution and fine detail can be generated. This thesis uses Inception scores to evaluate the quality of image generation. Inception scores can determine the diversity and reality of the generated images. The experimental results show that the Inception score of GraGAN is 4.35 while training with English sentences. Other models: AttnGan is 4.33 and StackGanV2 is 4.08. While training with Chinese sentence, the Inception score of GraGAN is 4.31 and AttnGan is 4.26. Obviously, the Inception Scores of the proposed model GraGAN are higher than that of AttnGAN after adjusting the Text encoder and Image encoder. The score of the Chinese sentence training is slightly lower than that of English sentences training since a Chinese word is very different from an English word. Each English word has its own meaning and can be recognized as a token however, in Chinese, it may need to have two or three words to compose a meaningful term. Moreover, StackGanV2 is used to generate high-resolution images, but lacks of the details, so the score is significantly lower than this paper. The contribution of this thesis is to convert the dataset into Chinese sentences, and through the pre-processing of the Chinese sentence and add two-way GRU pre-training, InceptionResNetV2 coding to reduce training time and computational complexity of the proposed model GraGAN. The experimental results show that the Chinese dataset can also achieve good results as English dataset in generative adversarial network.

並列關鍵字

artificial intelligence ； generative adversarial networks ； attention mechanism ； high resolution

參考文獻

Google Scholar

[1] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio “Generative Adversarial Networks” , Neural Information Processing Systems,2014

Google Scholar

[2] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin “Attention Is All You Need” , Neural Information Processing Systems,2017

Google Scholar

[3] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, Xiaodong He, “AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks”, Computer Vision and Pattern Recognition,2017

Google Scholar

[4] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber, “Gradient Flow in Recurrent Nets: the Diffculty of Learning Long-Term Dependencies” , IEEE Press,2001

Google Scholar

國際替代計量

基於注意力生成對抗網路的中文轉高解析度影像系統

全文下載

主題瀏覽