Abstract
Attention mechanisms have attracted considerable interest in image captioning due to its powerful performance. However, many visual attention models lack of considering correlation between image and textual context, which may lead to attention vectors containing irrelevant annotation vectors. In order to overcome this limitation, we propose a new text-based visual attention (TBVA) model which focuses on certain salient object automatically by eliminating the irrelevant information once given previously generated text. The proposed end-to-end caption generation model adopts the architecture of multimodal recurrent neural network. We leverage the transposed weight sharing scheme to achieve better performance by reducing the number of parameters. The effectiveness of our model is validated on MS COCO and Flickr30k. The results show that TBVA outperforms the state-of-art image captioning methods.
Similar content being viewed by others
References
Kiros R, Salakhutdinov R, Zemel R S (2014) Multimodal neural language models. In: ICML
Mao J, Xu W, Yang Y, Wang J et al. (2015) Deep captioning with Multimodal recurrent neural networks (M-RNN). In: ICLR
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: CVPR
Xu K, Ba JL, Kiros R et al (2016) Show, attend and tell: neural image caption generation with visual attention. arXiv:1502.03044v3 [cs.LG]
Dzmitry B, Kyunghyun C, Yoshua B (2014) Neural machine translation by jointly learning to align and translate. arXiv:409.0473
Zhou L, Xu C, Koch P, Corso JJ (2016) Watch what you just said: image captioning with text-conditional attention. arXiv:1606.04621 [cs.CV]
LeCun YA, Bottou L, Orr GB, Müller KR (2012) Efficient backprop. In: Montavon G, Orr GB, Müller KR (eds) Neural networks: tricks of the trade. Lecture Notes in Computer Science, vol 7700. Springer, Berlin, Heidelberg, pp 9–48.
Mao J, Wei X, Yang Y, Wang J, Huang Z, Yuille A (2015) Learning like a child: fast novel visual concept learning from sentence descriptions of images. arXiv:1504.06692 [cs.CV]
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 [cs.CV]
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnick CL (2014) Microsoft coco: common objects in context. arXiv preprint arXiv:1405.0312
Papineni K, Roukos S, Ward T, Zhu W (2002) BLEU: a method for automatic evaluation of machine translation. In: ACL, pp 311–318
Chen X, Fang H, Lin TY, Vedantam et al (2015) Microsoft coco captions: data collection and evaluation server. arXiv:1504.00325
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2015) Rethinking the inception architecture for computer vision. arXiv:1512.00567
Jia X, Gavves E, Fernando B et al (2015) Guiding the long-short term memory model for image caption generation. In: ICCV
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: CVPR
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China, under Grant No. 61673402, 61273270 and 60802069, the Natural Science Foundation of Guangdong Province (2017A030311029 and 2016B010109002) and by the Science and Technology Program of Guangzhou, China, under Grant No. 201704020180 and 201604020024, and by the fundamental Research Funds for the Central Universities of China.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
He, C., Hu, H. Image Captioning with Text-Based Visual Attention. Neural Process Lett 49, 177–185 (2019). https://doi.org/10.1007/s11063-018-9807-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-018-9807-7