A lightweight convolutional neural network for large-scale Chinese image caption

Zhao, Dexin; Yang, Ruixue; Guo, Shutao

doi:10.1007/s11801-021-0100-z

A lightweight convolutional neural network for large-scale Chinese image caption

Published: 07 July 2021

Volume 17, pages 361–366, (2021)
Cite this article

Optoelectronics Letters Aims and scope Submit manuscript

Dexin Zhao (赵德新)¹,
Ruixue Yang (杨瑞雪)¹ &
Shutao Guo (郭淑涛)¹

80 Accesses
2 Citations
Explore all metrics

Abstract

Image caption is a high-level task in the area of image understanding, in which most of the models adopt a convolutional neural network (CNN) to extract image features assigning a recurrent neural network (RNN) to generate sentences. Researchers tend to design complex networks with deeper layers to improve the performance of feature extraction in recent years. Increasing the size of the network could obtain features of high quality, but it is not an efficient way in terms of computational cost. A large number of parameters brought by CNN makes the research difficult to apply in human daily life. In order to reduce the information loss of the convolutional process with less cost, we propose a lightweight convolutional neural network, named as Bifurcate-CNN (B-CNN). Furthermore, recent works are devoted to generating captions in English, in this paper, we develop an image caption model that generates descriptions in Chinese. Compared with Inception-v3, the depth of our model is shallower with fewer parameters, and the computational cost is lower. Evaluated on the AI CHALLENGER dataset, we prove that our model can enhance the performance, improving BLEU-4 from 46.1 to 49.9 and CIDEr from 142.5 to 156.6 respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Li X, Uricchio T, Ballan L, Bertini M, Snoek C.G and Bimbo A.D, ACM Computing Surveys 49, 1 (2016).
Google Scholar
Vinyals O, Toshev A, Bengio S and Erhan D, Show and Tell: A Neural Image Caption Generator, IEEE Conference on Computer Vision and Pattern Recognition, 3156 (2015).
Jia X, Gavves E, Fernando B and Tuytelaars T, Guiding the Long-Short Term Memory Model for Image Caption Generation, IEEE International Conference on Computer Vision IEEE Computer Society, 2407 (2015).
Lu J, Yang J, Batra D and Parikh D, Neural Baby Talk, Conference on Computer Vision and Pattern Recognition, 7219 (2018).
Rennie S J, Marcheret E, Mroueh Y, Ross J and Goel V, Self-Critical Sequence Training for Image Captioning, IEEE Conference on Computer Vision and Pattern Recognition, 7008 (2017).
Yang J, Sun Y, Liang J, Ren B and Lai S, Neurocomputing 328, 56 (2019).
Article Google Scholar
Szegedy C, Vanhoucke V, Ioffe S, Shlens J and Wojna Z, Rethinking the Inception Architecture for Computer Vision, IEEE Conference on Computer Vision and Pattern Recognition, 2818 (2016).
Liu Z, Ma L, Wu J and Sun L, Journal of Chinese Information Processing 31, 162 (2017). (in Chinese)
Google Scholar
Lan W, Wang X, Yang G and LI X, Chinese Journal of Computers 42, 136 (2019). (in Chinese)
Google Scholar
Zhao D, Chang Z and Guo S, Neurocomputing 329, 476 (2019).
Article Google Scholar
Srivastava R, Greff K and Schmidhuber J, Training Very Deep Networks, Advances in Neural Information Processing Systems, 2368 (2015).
Kulkarni G, Premraj V, Dhar S, Li S, Choi Y, Berg A and Berg T, Baby Talk: Understanding and Generating Simple Image Descriptions, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2891 (2014).
Wu J, Zheng H, Zhao B, Li Y, Yan B, Liang R, Wang W, Zhou S, Lin G, Fu Y, Wang Y and Wang Y, Large-Scale Datasets for Going Deeper in Image Understanding, IEEE International Conference on Multimedia and Expo (ICME), 1480 (2019).
He K, Zhang X, Ren S and Sun Y, Deep Residual Learning for Image Recognition, IEEE Conference on Computer Vision and Pattern Recognition, 770 (2014).
Szegedy C, Ioffe S, Vanhoucke V and Alemi A A, Inception-v4, inception-resnet and the impact of residual connections on learning, AAAI Conference on Artificial Intelligence, 4278 (2017).
Papineni K, Roukos S, Ward T and Zhu W, Bleu: A Method for Automatic Evaluation of Machine Translation, 40th Annual Meeting of the Association for Computational Linguistics, 311 (2002).
Banerjee S and Lavie A, METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments, Meeting of the association for computational linguistics, 65 (2005).
Lin C, ROUGE: A Package for Automatic Evaluation of Summaries, Meeting of the Association for Computational Linguistics, 74 (2004).
Vedantam R, Zitnick C L and Parikh D, CIDEr: Consensus-Based Image Description Evaluation, Computer Vision and Pattern Recognition, 4566 (2015).

Download references

Author information

Authors and Affiliations

Tianjin Key Laboratory of Intelligence Computing and Novel Software Technology, Tianjin University of Technology, Tianjin, 300384, China
Dexin Zhao (赵德新), Ruixue Yang (杨瑞雪) & Shutao Guo (郭淑涛)

Authors

Dexin Zhao (赵德新)
View author publications
You can also search for this author in PubMed Google Scholar
Ruixue Yang (杨瑞雪)
View author publications
You can also search for this author in PubMed Google Scholar
Shutao Guo (郭淑涛)
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruixue Yang (杨瑞雪).

Additional information

This work has been supported by the National Natural Foundation of China (No.61571328).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, D., Yang, R. & Guo, S. A lightweight convolutional neural network for large-scale Chinese image caption. Optoelectron. Lett. 17, 361–366 (2021). https://doi.org/10.1007/s11801-021-0100-z

Download citation

Received: 14 June 2020
Revised: 01 September 2020
Published: 07 July 2021
Issue Date: June 2021
DOI: https://doi.org/10.1007/s11801-021-0100-z

Document code

A

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A lightweight convolutional neural network for large-scale Chinese image caption

Abstract

Access this article

Similar content being viewed by others

Improved Bengali Image Captioning via Deep Convolutional Neural Network Based Encoder-Decoder Model

Experimenting Encoder-Decoder Architecture for Visual Image Captioning

Image Captioning Encoder–Decoder Models Using CNN-RNN Architectures: A Comparative Study

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Document code

Navigation

A lightweight convolutional neural network for large-scale Chinese image caption

Abstract

Access this article

Similar content being viewed by others

Improved Bengali Image Captioning via Deep Convolutional Neural Network Based Encoder-Decoder Model

Experimenting Encoder-Decoder Architecture for Visual Image Captioning

Image Captioning Encoder–Decoder Models Using CNN-RNN Architectures: A Comparative Study

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Document code

Search

Navigation