Abstract
Captioning methods from predecessors that based on the conventional deep Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) architecture follow translation system using word-level modelling. But an optimal word segmentation algorithm is essential for segmenting sentence into words in word-level modelling, which is a very difficult task. In this paper, we built a character-level RNN (c-RNN) that directly modeled on captions with characterization where descriptive sentence is composed in a flow of characters. The c-RNN performs language task in finer level and naturally avoids the word segmentation issue. Our c-RNN empowered the language model to dynamically reason about word spelling as well as grammatical rules which results in expressive and elaborate sentence. We optimized parameters of neural nets by maximizing the probabilities of correctly generated characterized sentences. Quantitative and qualitative experiments on the most popular datasets MSCOCO and Flickr30k showed that our c-RNN could describe images with a considerably faster speed and satisfactory quality.
Similar content being viewed by others
References
Chen X, Zitnick CL (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: CVPR
Chung J, Cho K, Bengio Y (2016) A character-level decoder without explicit segmentation for neural machine translation In: Proceedings of the 54th annual meeting of the association for computational linguistics, pp 1693–1703
Denkowski M, Lavie A (2011) Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems. In: Proceedings of the sixth workshop on statistical machine translation, pp 85–91
Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2014) Long-term recurrent convolutional networks for visual recognition and description. Elsevier, Amsterdam
Fang H, Gupta S, Iandola F, Srivastava R.K, Deng L, Dollr P, Gao J, He X, Mitchell M, Platt JC (2015) From captions to visual concepts and back. In: CVPR
Hochreiter S, Schmidhuber J (1997) Long short-term memory. In: Neural Computation, pp 1735–1780
Karpathy A, Li FF (2017) Deep visual-semantic alignments for generating image descriptions. In: CVPR
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollr P, Zitnick CL (2014) Microsoft coco: common objects in context. In: ECCV
Mao J, Xu W, Yang Y, Wang J, Yuille AL (2015) Explain images with multimodal recurrent neural networks. In: ICLR
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. ACL
Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using Amazon’s mechanical turk. In: NAACL
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115:211–252
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: CVPR
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: CVPR
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. arXiv:1502.03044v3 [cs.LG]
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China (61673402, 61273270, 60802069), the Natural Science Foundation of Guangdong Province (2017A030311029, 2016B010123005, 2017B090909005), the Science and Technology Program of Guangzhou of China (201704020180, 201604020024), and the Fundamental Research Funds for the Central Universities of China.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Huang, G., Hu, H. c-RNN: A Fine-Grained Language Model for Image Captioning. Neural Process Lett 49, 683–691 (2019). https://doi.org/10.1007/s11063-018-9836-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-018-9836-2