Skip to main content
Log in

A novel deep fuzzy neural network semantic-enhanced method for automatic image captioning

  • Data analytics and machine learning
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

In recent times, the neural image captioning (NIC) is considered as a primitive problem artificial intelligence (AI) in which creates a connection between computer vision (CV) and natural language processing (NLP). However, recent attribute-based and textual semantic attention-based models in NIC still encounter challenges related to irrelevant concentration of the designed attention mechanism on the relationship between extracted visual features and textual representations of corresponding image’s caption. Moreover, recent NIC-based models also suffer from the uncertainties and noises of extracted visual latent features from images which sometime leads to the disruption of the given image captioning model to sufficiently attend on the correct visual concepts. To solve these challenges, in this paper, we proposed an end-to-end integrated deep fuzzy-neural network with the unified attention-based semantic-enhanced vision-language approach, called as FuzzSemNIC. To alleviate noises and ambiguities from the extracted visual features, we apply a fused deep fuzzy-based neural network architecture to effectively learn and generate the visual representations of images. Then, the learnt fuzzy-based visual embedding vectors are combined with selective attributes/concepts of images via a recurrent neural network (RNN) architecture to incorporate the fused latent visual features into captioning task. Finally, the fused visual representations are integrated with a unified vision-language encoder–decoder for handling caption generation task. Extensive experiments in benchmark NIC-based datasets demonstrate the effectiveness of our proposed FuzzSemNIC model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

There is no data associated with our studies in this paper.

Notes

  1. CoreNLP library for NLP (Java): https://stanfordnlp.github.io/CoreNLP/.

  2. GloVe pre-trained word embedding model: https://nlp.stanford.edu/projects/glove/.

References

  • Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision

  • Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR

  • Deng Y, Ren Z, Kong Y, Bao F, Dai Q (2016) A hierarchical fused fuzzy deep neural network for data classification. IEEE Trans Fuzzy Syst 25(4):1006–1012

    Article  Google Scholar 

  • Devlin J, Gupta S, Girshick R, Mitchell M, Zitnick CL (2015) Exploring nearest neighbor approaches for image captioning, arXiv preprint arXiv:1505.04467

  • Devlin J, Chang MW, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

  • Fang F, Wang H, Tang P (2018) Image captioning with word level attention. In: 25th IEEE International Conference on Image Processing (ICIP)

  • Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Deng L (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  • Han D, Zhou S, Li KC, de Mello RF (2021) Cross-modality co-attention networks for visual question answering. Soft Comput 25(7):5411–5421

    Article  Google Scholar 

  • He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition

  • Huang L, Wang W, Chen J, Wei XY (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision

  • Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  • Kaya GU, Erkaymaz O, Sarac Z (2019) A new adaptive neuro-fuzzy solution for optimization of the parameters in the digital holography setup. Soft Comput 23(18):8827–8837

    Article  Google Scholar 

  • Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg TL (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903

    Article  Google Scholar 

  • Lebret R, Pinheiro P, Collobert R (2015) Phrase-based image captioning. In: International Conference on Machine Learning (PMLR)

  • Lin FJ, Lin CH, Shen PH (2001) Self-constructing fuzzy neural network speed controller for permanent-magnet synchronous motor drive. IEEE Trans Fuzzy Syst 9(5):751–759

    Article  Google Scholar 

  • Lin CT, Yeh CM, Liang SF, Chung JF, Kumar N (2006) Support-vector-based fuzzy neural network for pattern classification. IEEE Trans Fuzzy Syst 14(1):31–41

    Article  Google Scholar 

  • Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D (2014) The Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations

  • Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations (ICRL)

  • Mikolov T, Grave É, Bojanowski P, Puhrsch C, Joulin A (2018) Advances in Pre-Training Distributed Word Representations. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation

  • Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

  • Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems

  • Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

  • Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition

  • Sutskever I, Vinyals O, Le QV (2014) Sequence to Sequence Learning with Neural Networks. In: Advances in Neural Information Processing Systems

  • Tu S, Ur Rehman S, Waqas M, Ur Rehman O, Shah Z, Yang Z, Koubaa A (2021) ModPSO-CNN: an evolutionary convolution neural network with application to visual recognition. Soft Comput 25(3):2165–2176

    Article  Google Scholar 

  • Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  • Wang C, Yang H, Bartz C, Meinel C (2016) Image captioning with deep bidirectional LSTMs. In: Proceedings of the 24th ACM international conference on Multimedia, pp. 988–997

  • You Q, Jin H, Wang Z, Fang C, Luo J, You Q, Jin H, Wang Z, Fang C, Luo J (2016) In: Proceedings of the IEEE conference on computer vision and pattern recognition

  • Yu N, Hu X, Song B, Yang J, Zhang J (2018) Topic-oriented image captioning based on order-embedding. IEEE Trans Image Process 28(6):2743–2754

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang H, Qiu D, Wu R, Ji D, Li G, Niu Z, Li T (2020) Novel model to integrate word embeddings and syntactic trees for automatic caption generation from images. Soft Comput 24(2):1377–1397

    Article  Google Scholar 

Download references

Acknowledgement

We would like to thank Nguyen Tat Thanh University, Ho Chi Minh City, Vietnam for the support of time and facilities for this study.

Funding

This research is funded by Nguyen Tat Thanh University, Ho Chi Minh City, Vietnam.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tham Vo.

Ethics declarations

Conflict of interest

Dr. Tham Vo has received a research grant from Nguyen Tat Thanh University, Ho Chi Minh City, Vietnam. She and the remaining authors have no other conflicts of interest or financial ties.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vo, T. A novel deep fuzzy neural network semantic-enhanced method for automatic image captioning. Soft Comput 27, 14647–14658 (2023). https://doi.org/10.1007/s00500-023-09100-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-023-09100-0

Keywords

Navigation