Abstract
Image captioning is a task to make an image description, which needs recognizing the important attributes and also their relationships in the image. This task requires to generate semantically and syntactically correct sentences. Most image captioning models are based on RNN and MLE methods, but we propose a novel model based on GAN networks where it generates the caption of the image through the representation of the image by utilizing the generator adversarial network and it does not need any secondary learning algorithm like policy gradient. Due to the complexity of benchmark datasets such as Flickr and Coco, in both volume and complexity, we introduce a new dataset and perform the experiments on it. The experimental results show the effectiveness of our model compared to the state-of-the-art image captioning methods.
Similar content being viewed by others
References
Agrawal A, Lu J, Antol S, Mitchell M, Zitnick CL, Parikh D, Batra D. VQA: visual question answering. Int J Comput Vis. 2016;1(123):4.
Das A, Kottur S, Moura JM, Lee S, Batra D. Proceedings of the IEEE international conference on computer vision. 2017;2951–60.
Bahdanau D, Cho K, Bengio Y. In: 3rd International Conference on Learning Representations, ICLR 2015. 2015.
Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2014; pp. 1724–1734.
LeCun Y, Haffner P, Bottou L, Bengio Y. In: Shape, contour and grouping in computer vision Springer, 1999; 319–345
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735.
Stigler SM, et al. The epic story of maximum likelihood. Stat Sci. 2007;22(4):598.
Bengio S, Vinyals O, Jaitly N, Shazeer N. Advances in Neural Information Processing Systems. 2015;1171–9.
Sutton R.S, Barto A.G. Reinforcement learning: an introduction. 2011.
Hossain MZ, Sohel F, Shiratuddin MF, Laga H. A comprehensive survey of deep learning for image captioning. ACM Comput Surv (CSUR). 2019;51(6):1.
Aker A, Gaizauskas R. In: Proceedings of the 48th annual meeting of the association for computational linguistics (Association for Computational Linguistics), 2010; 1250–1258.
Elliott D, Keller F. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing 2013; 1292–1302.
Kuznetsova P, Ordonez V, Berg A.C, Berg T.L, Choi Y. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1 (Association for Computational Linguistics), 2012; 359–368.
Mitchell M, Han X, Dodge J, Mensch A, Goyal A, Berg A, Yamaguchi K, Berg T, Stratos K, Daumé III H. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (Association for Computational Linguistics), 2012; 747–756.
Kuznetsova P, Ordonez V, Berg TL, Choi Y. Treetalk: composition and compression of trees for image descriptions. Trans Assoc Comput Ling. 2014;2:351.
Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S. In: European conference on computer vision (Springer), 2014; 529–545.
Hodosh M, Young P, Hockenmaier J. Framing image description as a ranking task: Data, models and evaluation metrics. J Artif Intell Res. 2013;47:853.
Ordonez V, Kulkarni G, Berg TL. Advances in neural information processing systems. 2011;1143–51.
Sun C, Gan C, Nevatia R. Proceedings of the IEEE international conference on computer vision. 2015;2596–604.
Kiros R, Salakhutdinov R, Zemel R.S. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 2014.
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y. International conference on machine learning. 2015;2048–57.
Yao T, Pan Y, Li Y, Qiu Z, Mei T. Proceedings of the IEEE International Conference on Computer Vision. 2017;4894–902.
You Q, Jin H, Wang Z, Fang C, Luo J. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016;4651–9.
Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE Transa Pattern Anal Mach Intell. 2016;39(4):652.
Shi H, Li P, Wang B, Wang Z. In: Proceedings of the 10th International Conference on Internet Multimedia Computing and Service. 2018; 1–5.
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Advances in neural information processing systems. 2014;2672–80.
Yu L, Zhang W, Wang J, Yu Y. In: Thirty-first AAAI conference on artificial intelligence 2017.
Fedus W, Goodfellow I, Dai A.M. In: International Conference on Learning Representations 2018.
Kramer MA. Nonlinear principal component analysis using autoassociative neural networks. AIChE J. 1991;37(2):233.
Krizhevsky A, Sutskever I, Hinton GE. Advances in neural information processing systems. 2012;1097–105.
Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 2013.
Ioffe S, Szegedy C. International Conference on Machine Learning. 2015;448–56.
Xu B, Wang N, Chen T, Li M. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853 2015.
Young P, Lai A, Hodosh M, Hockenmaier J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Ling. 2014;2:67.
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S,. Saenko K, Darrell T. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015; pp. 2625–2634.
Chen X, Fang H, Lin T.Y, Vedantam R, Gupta S, Dollár P, Zitnick C.L. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 2015.
Maaten LVdD, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9:2579.
Budhkar A, Vishnubhotla K, Hossain S, Rudzicz F. In: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019) 2019; 15–26.
Lu J, Yang J, Batra D, Parikh D. Proceedings of the IEEE conference on computer vision and pattern recognition. 2018;7219–28.
He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 2015. arXiv:abs/1512.03385.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Human participants or animals
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Dehaqi, A.M., Seydi, V. & Madadi, Y. Adversarial Image Caption Generator Network. SN COMPUT. SCI. 2, 182 (2021). https://doi.org/10.1007/s42979-021-00486-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-021-00486-y