Image Captioning with Text-Based Visual Attention

He, Chen; Hu, Haifeng

doi:10.1007/s11063-018-9807-7

Image Captioning with Text-Based Visual Attention

Published: 27 February 2018

Volume 49, pages 177–185, (2019)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

850 Accesses
16 Citations
Explore all metrics

Abstract

Attention mechanisms have attracted considerable interest in image captioning due to its powerful performance. However, many visual attention models lack of considering correlation between image and textual context, which may lead to attention vectors containing irrelevant annotation vectors. In order to overcome this limitation, we propose a new text-based visual attention (TBVA) model which focuses on certain salient object automatically by eliminating the irrelevant information once given previously generated text. The proposed end-to-end caption generation model adopts the architecture of multimodal recurrent neural network. We leverage the transposed weight sharing scheme to achieve better performance by reducing the number of parameters. The effectiveness of our model is validated on MS COCO and Flickr30k. The results show that TBVA outperforms the state-of-art image captioning methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Visual attention network

Article Open access 28 July 2023

References

Kiros R, Salakhutdinov R, Zemel R S (2014) Multimodal neural language models. In: ICML
Mao J, Xu W, Yang Y, Wang J et al. (2015) Deep captioning with Multimodal recurrent neural networks (M-RNN). In: ICLR
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: CVPR
Xu K, Ba JL, Kiros R et al (2016) Show, attend and tell: neural image caption generation with visual attention. arXiv:1502.03044v3 [cs.LG]
Dzmitry B, Kyunghyun C, Yoshua B (2014) Neural machine translation by jointly learning to align and translate. arXiv:409.0473
Zhou L, Xu C, Koch P, Corso JJ (2016) Watch what you just said: image captioning with text-conditional attention. arXiv:1606.04621 [cs.CV]
LeCun YA, Bottou L, Orr GB, Müller KR (2012) Efficient backprop. In: Montavon G, Orr GB, Müller KR (eds) Neural networks: tricks of the trade. Lecture Notes in Computer Science, vol 7700. Springer, Berlin, Heidelberg, pp 9–48.
Mao J, Wei X, Yang Y, Wang J, Huang Z, Yuille A (2015) Learning like a child: fast novel visual concept learning from sentence descriptions of images. arXiv:1504.06692 [cs.CV]
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 [cs.CV]
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnick CL (2014) Microsoft coco: common objects in context. arXiv preprint arXiv:1405.0312
Papineni K, Roukos S, Ward T, Zhu W (2002) BLEU: a method for automatic evaluation of machine translation. In: ACL, pp 311–318
Chen X, Fang H, Lin TY, Vedantam et al (2015) Microsoft coco captions: data collection and evaluation server. arXiv:1504.00325
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2015) Rethinking the inception architecture for computer vision. arXiv:1512.00567
Jia X, Gavves E, Fernando B et al (2015) Guiding the long-short term memory model for image caption generation. In: ICCV
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: CVPR

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China, under Grant No. 61673402, 61273270 and 60802069, the Natural Science Foundation of Guangdong Province (2017A030311029 and 2016B010109002) and by the Science and Technology Program of Guangzhou, China, under Grant No. 201704020180 and 201604020024, and by the fundamental Research Funds for the Central Universities of China.

Author information

Authors and Affiliations

School of Electronics and Information Engineering, Sun Yat-sen University, Guangzhou, 510006, Guangdong, China
Chen He & Haifeng Hu

Authors

Chen He
View author publications
You can also search for this author in PubMed Google Scholar
Haifeng Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haifeng Hu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

He, C., Hu, H. Image Captioning with Text-Based Visual Attention. Neural Process Lett 49, 177–185 (2019). https://doi.org/10.1007/s11063-018-9807-7

Download citation

Published: 27 February 2018
Issue Date: 15 February 2019
DOI: https://doi.org/10.1007/s11063-018-9807-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Image Captioning with Text-Based Visual Attention

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Visual attention network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Image Captioning with Text-Based Visual Attention

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Visual attention network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation