Skip to main content
Log in

Image Captioning with Text-Based Visual Attention

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Attention mechanisms have attracted considerable interest in image captioning due to its powerful performance. However, many visual attention models lack of considering correlation between image and textual context, which may lead to attention vectors containing irrelevant annotation vectors. In order to overcome this limitation, we propose a new text-based visual attention (TBVA) model which focuses on certain salient object automatically by eliminating the irrelevant information once given previously generated text. The proposed end-to-end caption generation model adopts the architecture of multimodal recurrent neural network. We leverage the transposed weight sharing scheme to achieve better performance by reducing the number of parameters. The effectiveness of our model is validated on MS COCO and Flickr30k. The results show that TBVA outperforms the state-of-art image captioning methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Kiros R, Salakhutdinov R, Zemel R S (2014) Multimodal neural language models. In: ICML

  2. Mao J, Xu W, Yang Y, Wang J et al. (2015) Deep captioning with Multimodal recurrent neural networks (M-RNN). In: ICLR

  3. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: CVPR

  4. Xu K, Ba JL, Kiros R et al (2016) Show, attend and tell: neural image caption generation with visual attention. arXiv:1502.03044v3 [cs.LG]

  5. Dzmitry B, Kyunghyun C, Yoshua B (2014) Neural machine translation by jointly learning to align and translate. arXiv:409.0473

  6. Zhou L, Xu C, Koch P, Corso JJ (2016) Watch what you just said: image captioning with text-conditional attention. arXiv:1606.04621 [cs.CV]

  7. LeCun YA, Bottou L, Orr GB, Müller KR (2012) Efficient backprop. In: Montavon G, Orr GB, Müller KR (eds) Neural networks: tricks of the trade. Lecture Notes in Computer Science, vol 7700. Springer, Berlin, Heidelberg, pp 9–48.

  8. Mao J, Wei X, Yang Y, Wang J, Huang Z, Yuille A (2015) Learning like a child: fast novel visual concept learning from sentence descriptions of images. arXiv:1504.06692 [cs.CV]

  9. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 [cs.CV]

  10. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnick CL (2014) Microsoft coco: common objects in context. arXiv preprint arXiv:1405.0312

  11. Papineni K, Roukos S, Ward T, Zhu W (2002) BLEU: a method for automatic evaluation of machine translation. In: ACL, pp 311–318

  12. Chen X, Fang H, Lin TY, Vedantam et al (2015) Microsoft coco captions: data collection and evaluation server. arXiv:1504.00325

  13. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2015) Rethinking the inception architecture for computer vision. arXiv:1512.00567

  14. Jia X, Gavves E, Fernando B et al (2015) Guiding the long-short term memory model for image caption generation. In: ICCV

  15. You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: CVPR

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China, under Grant No. 61673402, 61273270 and 60802069, the Natural Science Foundation of Guangdong Province (2017A030311029 and 2016B010109002) and by the Science and Technology Program of Guangzhou, China, under Grant No. 201704020180 and 201604020024, and by the fundamental Research Funds for the Central Universities of China.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haifeng Hu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

He, C., Hu, H. Image Captioning with Text-Based Visual Attention. Neural Process Lett 49, 177–185 (2019). https://doi.org/10.1007/s11063-018-9807-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-018-9807-7

Keywords

Navigation