c-RNN: A Fine-Grained Language Model for Image Captioning

Huang, Gengshi; Hu, Haifeng

doi:10.1007/s11063-018-9836-2

c-RNN: A Fine-Grained Language Model for Image Captioning

Published: 11 May 2018

Volume 49, pages 683–691, (2019)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

627 Accesses
14 Citations
Explore all metrics

Abstract

Captioning methods from predecessors that based on the conventional deep Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) architecture follow translation system using word-level modelling. But an optimal word segmentation algorithm is essential for segmenting sentence into words in word-level modelling, which is a very difficult task. In this paper, we built a character-level RNN (c-RNN) that directly modeled on captions with characterization where descriptive sentence is composed in a flow of characters. The c-RNN performs language task in finer level and naturally avoids the word segmentation issue. Our c-RNN empowered the language model to dynamically reason about word spelling as well as grammatical rules which results in expressive and elaborate sentence. We optimized parameters of neural nets by maximizing the probabilities of correctly generated characterized sentences. Quantitative and qualitative experiments on the most popular datasets MSCOCO and Flickr30k showed that our c-RNN could describe images with a considerably faster speed and satisfactory quality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Image Captioning Encoder–Decoder Models Using CNN-RNN Architectures: A Comparative Study

Article 16 June 2022

K. Revati Suresh, Arun Jarapala & P. V. Sudeep

Integration of textual cues for fine-grained image captioning using deep CNN and LSTM

Article 19 October 2019

Neeraj Gupta & Anand Singh Jalal

A local representation-enhanced recurrent convolutional network for image captioning

Article 12 April 2022

Xiaoyi Wang & Jun Huang

Notes

https://github.com/tylin/coco-caption.

References

Chen X, Zitnick CL (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: CVPR
Chung J, Cho K, Bengio Y (2016) A character-level decoder without explicit segmentation for neural machine translation In: Proceedings of the 54th annual meeting of the association for computational linguistics, pp 1693–1703
Denkowski M, Lavie A (2011) Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems. In: Proceedings of the sixth workshop on statistical machine translation, pp 85–91
Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2014) Long-term recurrent convolutional networks for visual recognition and description. Elsevier, Amsterdam
Book Google Scholar
Fang H, Gupta S, Iandola F, Srivastava R.K, Deng L, Dollr P, Gao J, He X, Mitchell M, Platt JC (2015) From captions to visual concepts and back. In: CVPR
Hochreiter S, Schmidhuber J (1997) Long short-term memory. In: Neural Computation, pp 1735–1780
Karpathy A, Li FF (2017) Deep visual-semantic alignments for generating image descriptions. In: CVPR
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollr P, Zitnick CL (2014) Microsoft coco: common objects in context. In: ECCV
Mao J, Xu W, Yang Y, Wang J, Yuille AL (2015) Explain images with multimodal recurrent neural networks. In: ICLR
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. ACL
Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using Amazon’s mechanical turk. In: NAACL
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115:211–252
Article MathSciNet Google Scholar
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: CVPR
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: CVPR
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. arXiv:1502.03044v3 [cs.LG]

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (61673402, 61273270, 60802069), the Natural Science Foundation of Guangdong Province (2017A030311029, 2016B010123005, 2017B090909005), the Science and Technology Program of Guangzhou of China (201704020180, 201604020024), and the Fundamental Research Funds for the Central Universities of China.

Author information

Authors and Affiliations

School of Electronics and Information Engineering, Sun Yat-Sen University, Guangzhou, 510006, Guangdong, China
Gengshi Huang & Haifeng Hu

Authors

Gengshi Huang
View author publications
You can also search for this author in PubMed Google Scholar
Haifeng Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haifeng Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, G., Hu, H. c-RNN: A Fine-Grained Language Model for Image Captioning. Neural Process Lett 49, 683–691 (2019). https://doi.org/10.1007/s11063-018-9836-2

Download citation

Published: 11 May 2018
Issue Date: 15 April 2019
DOI: https://doi.org/10.1007/s11063-018-9836-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

c-RNN: A Fine-Grained Language Model for Image Captioning

Abstract

Access this article

Similar content being viewed by others

Image Captioning Encoder–Decoder Models Using CNN-RNN Architectures: A Comparative Study

Integration of textual cues for fine-grained image captioning using deep CNN and LSTM

A local representation-enhanced recurrent convolutional network for image captioning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

c-RNN: A Fine-Grained Language Model for Image Captioning

Abstract

Access this article

Similar content being viewed by others

Image Captioning Encoder–Decoder Models Using CNN-RNN Architectures: A Comparative Study

Integration of textual cues for fine-grained image captioning using deep CNN and LSTM

A local representation-enhanced recurrent convolutional network for image captioning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation