Skip to main content
Log in

Adversarial Image Caption Generator Network

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Image captioning is a task to make an image description, which needs recognizing the important attributes and also their relationships in the image. This task requires to generate semantically and syntactically correct sentences. Most image captioning models are based on RNN and MLE methods, but we propose a novel model based on GAN networks where it generates the caption of the image through the representation of the image by utilizing the generator adversarial network and it does not need any secondary learning algorithm like policy gradient. Due to the complexity of benchmark datasets such as Flickr and Coco, in both volume and complexity, we introduce a new dataset and perform the experiments on it. The experimental results show the effectiveness of our model compared to the state-of-the-art image captioning methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  1. Agrawal A, Lu J, Antol S, Mitchell M, Zitnick CL, Parikh D, Batra D. VQA: visual question answering. Int J Comput Vis. 2016;1(123):4.

    MathSciNet  Google Scholar 

  2. Das A, Kottur S, Moura JM, Lee S, Batra D. Proceedings of the IEEE international conference on computer vision. 2017;2951–60.

  3. Bahdanau D, Cho K, Bengio Y. In: 3rd International Conference on Learning Representations, ICLR 2015. 2015.

  4. Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H,  Bengio Y. in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2014; pp. 1724–1734.

  5. LeCun Y, Haffner P, Bottou L, Bengio Y. In: Shape, contour and grouping in computer vision Springer, 1999; 319–345

  6. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735.

    Article  Google Scholar 

  7. Stigler SM, et al. The epic story of maximum likelihood. Stat Sci. 2007;22(4):598.

    Article  MathSciNet  Google Scholar 

  8. Bengio S, Vinyals O, Jaitly N, Shazeer N. Advances in Neural Information Processing Systems. 2015;1171–9.

  9. Sutton R.S, Barto A.G. Reinforcement learning: an introduction. 2011.

  10. Hossain MZ, Sohel F, Shiratuddin MF, Laga H. A comprehensive survey of deep learning for image captioning. ACM Comput Surv (CSUR). 2019;51(6):1.

    Article  Google Scholar 

  11. Aker A, Gaizauskas R. In: Proceedings of the 48th annual meeting of the association for computational linguistics (Association for Computational Linguistics), 2010; 1250–1258.

  12. Elliott D, Keller F. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing 2013; 1292–1302.

  13. Kuznetsova P, Ordonez V, Berg A.C, Berg T.L, Choi Y. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1 (Association for Computational Linguistics), 2012; 359–368.

  14. Mitchell M, Han X, Dodge J, Mensch A, Goyal A, Berg A, Yamaguchi K, Berg T,  Stratos K, Daumé III H. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (Association for Computational Linguistics), 2012; 747–756.

  15. Kuznetsova P, Ordonez V, Berg TL, Choi Y. Treetalk: composition and compression of trees for image descriptions. Trans Assoc Comput Ling. 2014;2:351.

    Google Scholar 

  16. Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S. In: European conference on computer vision (Springer), 2014; 529–545.

  17. Hodosh M, Young P, Hockenmaier J. Framing image description as a ranking task: Data, models and evaluation metrics. J Artif Intell Res. 2013;47:853.

    Article  MathSciNet  Google Scholar 

  18. Ordonez V, Kulkarni G, Berg TL. Advances in neural information processing systems. 2011;1143–51.

  19. Sun C, Gan C, Nevatia R. Proceedings of the IEEE international conference on computer vision. 2015;2596–604.

  20. Kiros R, Salakhutdinov R, Zemel R.S. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 2014.

  21. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y. International conference on machine learning. 2015;2048–57.

  22. Yao T, Pan Y, Li Y, Qiu Z, Mei T. Proceedings of the IEEE International Conference on Computer Vision. 2017;4894–902.

  23. You Q, Jin H, Wang Z, Fang C, Luo J. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016;4651–9.

  24. Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE Transa Pattern Anal Mach Intell. 2016;39(4):652.

    Article  Google Scholar 

  25. Shi H, Li P, Wang B, Wang Z. In: Proceedings of the 10th International Conference on Internet Multimedia Computing and Service. 2018; 1–5.

  26. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Advances in neural information processing systems. 2014;2672–80.

  27. Yu L, Zhang W, Wang J, Yu Y. In: Thirty-first AAAI conference on artificial intelligence 2017.

  28. Fedus W, Goodfellow I, Dai A.M. In: International Conference on Learning Representations 2018.

  29. Kramer MA. Nonlinear principal component analysis using autoassociative neural networks. AIChE J. 1991;37(2):233.

    Article  Google Scholar 

  30. Krizhevsky A, Sutskever I, Hinton GE. Advances in neural information processing systems. 2012;1097–105.

  31. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 2013.

  32. Ioffe S, Szegedy C. International Conference on Machine Learning. 2015;448–56.

  33. Xu B, Wang N, Chen T, Li M. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853 2015.

  34. Young P, Lai A, Hodosh M, Hockenmaier J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Ling. 2014;2:67.

    Google Scholar 

  35. Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S,. Saenko K, Darrell T. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015; pp. 2625–2634.

  36. Chen X, Fang H, Lin T.Y, Vedantam R, Gupta S, Dollár P, Zitnick C.L. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 2015.

  37. Maaten LVdD, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9:2579.

    MATH  Google Scholar 

  38. Budhkar A, Vishnubhotla K, Hossain S, Rudzicz F. In: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019) 2019; 15–26.

  39. Lu J, Yang J, Batra D, Parikh D. Proceedings of the IEEE conference on computer vision and pattern recognition. 2018;7219–28.

  40. He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 2015. arXiv:abs/1512.03385.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vahid Seydi.

Ethics declarations

Human participants or animals

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dehaqi, A.M., Seydi, V. & Madadi, Y. Adversarial Image Caption Generator Network. SN COMPUT. SCI. 2, 182 (2021). https://doi.org/10.1007/s42979-021-00486-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-021-00486-y

Keywords

Navigation