Skip to main content
Log in

c-RNN: A Fine-Grained Language Model for Image Captioning

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Captioning methods from predecessors that based on the conventional deep Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) architecture follow translation system using word-level modelling. But an optimal word segmentation algorithm is essential for segmenting sentence into words in word-level modelling, which is a very difficult task. In this paper, we built a character-level RNN (c-RNN) that directly modeled on captions with characterization where descriptive sentence is composed in a flow of characters. The c-RNN performs language task in finer level and naturally avoids the word segmentation issue. Our c-RNN empowered the language model to dynamically reason about word spelling as well as grammatical rules which results in expressive and elaborate sentence. We optimized parameters of neural nets by maximizing the probabilities of correctly generated characterized sentences. Quantitative and qualitative experiments on the most popular datasets MSCOCO and Flickr30k showed that our c-RNN could describe images with a considerably faster speed and satisfactory quality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. https://github.com/tylin/coco-caption.

References

  1. Chen X, Zitnick CL (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: CVPR

  2. Chung J, Cho K, Bengio Y (2016) A character-level decoder without explicit segmentation for neural machine translation In: Proceedings of the 54th annual meeting of the association for computational linguistics, pp 1693–1703

  3. Denkowski M, Lavie A (2011) Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems. In: Proceedings of the sixth workshop on statistical machine translation, pp 85–91

  4. Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2014) Long-term recurrent convolutional networks for visual recognition and description. Elsevier, Amsterdam

    Book  Google Scholar 

  5. Fang H, Gupta S, Iandola F, Srivastava R.K, Deng L, Dollr P, Gao J, He X, Mitchell M, Platt JC (2015) From captions to visual concepts and back. In: CVPR

  6. Hochreiter S, Schmidhuber J (1997) Long short-term memory. In: Neural Computation, pp 1735–1780

  7. Karpathy A, Li FF (2017) Deep visual-semantic alignments for generating image descriptions. In: CVPR

  8. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollr P, Zitnick CL (2014) Microsoft coco: common objects in context. In: ECCV

  9. Mao J, Xu W, Yang Y, Wang J, Yuille AL (2015) Explain images with multimodal recurrent neural networks. In: ICLR

  10. Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. ACL

  11. Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using Amazon’s mechanical turk. In: NAACL

  12. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115:211–252

    Article  MathSciNet  Google Scholar 

  13. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: CVPR

  14. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: CVPR

  15. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. arXiv:1502.03044v3 [cs.LG]

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (61673402, 61273270, 60802069), the Natural Science Foundation of Guangdong Province (2017A030311029, 2016B010123005, 2017B090909005), the Science and Technology Program of Guangzhou of China (201704020180, 201604020024), and the Fundamental Research Funds for the Central Universities of China.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haifeng Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, G., Hu, H. c-RNN: A Fine-Grained Language Model for Image Captioning. Neural Process Lett 49, 683–691 (2019). https://doi.org/10.1007/s11063-018-9836-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-018-9836-2

Keywords

Navigation