A novel deep fuzzy neural network semantic-enhanced method for automatic image captioning

Vo, Tham

doi:10.1007/s00500-023-09100-0

A novel deep fuzzy neural network semantic-enhanced method for automatic image captioning

Data analytics and machine learning
Published: 22 August 2023

Volume 27, pages 14647–14658, (2023)
Cite this article

Soft Computing Aims and scope Submit manuscript

Tham Vo ORCID: orcid.org/0000-0001-7291-4168¹

127 Accesses
Explore all metrics

Abstract

In recent times, the neural image captioning (NIC) is considered as a primitive problem artificial intelligence (AI) in which creates a connection between computer vision (CV) and natural language processing (NLP). However, recent attribute-based and textual semantic attention-based models in NIC still encounter challenges related to irrelevant concentration of the designed attention mechanism on the relationship between extracted visual features and textual representations of corresponding image’s caption. Moreover, recent NIC-based models also suffer from the uncertainties and noises of extracted visual latent features from images which sometime leads to the disruption of the given image captioning model to sufficiently attend on the correct visual concepts. To solve these challenges, in this paper, we proposed an end-to-end integrated deep fuzzy-neural network with the unified attention-based semantic-enhanced vision-language approach, called as FuzzSemNIC. To alleviate noises and ambiguities from the extracted visual features, we apply a fused deep fuzzy-based neural network architecture to effectively learn and generate the visual representations of images. Then, the learnt fuzzy-based visual embedding vectors are combined with selective attributes/concepts of images via a recurrent neural network (RNN) architecture to incorporate the fused latent visual features into captioning task. Finally, the fused visual representations are integrated with a unified vision-language encoder–decoder for handling caption generation task. Extensive experiments in benchmark NIC-based datasets demonstrate the effectiveness of our proposed FuzzSemNIC model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Attention-Guided Visual Semantic Fusion Method for Remote Sensing Image Captioning

A local representation-enhanced recurrent convolutional network for image captioning

Article 12 April 2022

Bidirectional Multimodal Recurrent Neural Networks with Refined Visual Features for Image Captioning

Data availability

There is no data associated with our studies in this paper.

Notes

CoreNLP library for NLP (Java): https://stanfordnlp.github.io/CoreNLP/.
GloVe pre-trained word embedding model: https://nlp.stanford.edu/projects/glove/.

References

Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR
Deng Y, Ren Z, Kong Y, Bao F, Dai Q (2016) A hierarchical fused fuzzy deep neural network for data classification. IEEE Trans Fuzzy Syst 25(4):1006–1012
Article Google Scholar
Devlin J, Gupta S, Girshick R, Mitchell M, Zitnick CL (2015) Exploring nearest neighbor approaches for image captioning, arXiv preprint arXiv:1505.04467
Devlin J, Chang MW, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Fang F, Wang H, Tang P (2018) Image captioning with word level attention. In: 25th IEEE International Conference on Image Processing (ICIP)
Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Deng L (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Han D, Zhou S, Li KC, de Mello RF (2021) Cross-modality co-attention networks for visual question answering. Soft Comput 25(7):5411–5421
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition
Huang L, Wang W, Chen J, Wei XY (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Kaya GU, Erkaymaz O, Sarac Z (2019) A new adaptive neuro-fuzzy solution for optimization of the parameters in the digital holography setup. Soft Comput 23(18):8827–8837
Article Google Scholar
Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg TL (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903
Article Google Scholar
Lebret R, Pinheiro P, Collobert R (2015) Phrase-based image captioning. In: International Conference on Machine Learning (PMLR)
Lin FJ, Lin CH, Shen PH (2001) Self-constructing fuzzy neural network speed controller for permanent-magnet synchronous motor drive. IEEE Trans Fuzzy Syst 9(5):751–759
Article Google Scholar
Lin CT, Yeh CM, Liang SF, Chung JF, Kumar N (2006) Support-vector-based fuzzy neural network for pattern classification. IEEE Trans Fuzzy Syst 14(1):31–41
Article Google Scholar
Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D (2014) The Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations (ICRL)
Mikolov T, Grave É, Bojanowski P, Puhrsch C, Joulin A (2018) Advances in Pre-Training Distributed Word Representations. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition
Sutskever I, Vinyals O, Le QV (2014) Sequence to Sequence Learning with Neural Networks. In: Advances in Neural Information Processing Systems
Tu S, Ur Rehman S, Waqas M, Ur Rehman O, Shah Z, Yang Z, Koubaa A (2021) ModPSO-CNN: an evolutionary convolution neural network with application to visual recognition. Soft Comput 25(3):2165–2176
Article Google Scholar
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Wang C, Yang H, Bartz C, Meinel C (2016) Image captioning with deep bidirectional LSTMs. In: Proceedings of the 24th ACM international conference on Multimedia, pp. 988–997
You Q, Jin H, Wang Z, Fang C, Luo J, You Q, Jin H, Wang Z, Fang C, Luo J (2016) In: Proceedings of the IEEE conference on computer vision and pattern recognition
Yu N, Hu X, Song B, Yang J, Zhang J (2018) Topic-oriented image captioning based on order-embedding. IEEE Trans Image Process 28(6):2743–2754
Article MathSciNet MATH Google Scholar
Zhang H, Qiu D, Wu R, Ji D, Li G, Niu Z, Li T (2020) Novel model to integrate word embeddings and syntactic trees for automatic caption generation from images. Soft Comput 24(2):1377–1397
Article Google Scholar

Download references

Acknowledgement

We would like to thank Nguyen Tat Thanh University, Ho Chi Minh City, Vietnam for the support of time and facilities for this study.

Funding

This research is funded by Nguyen Tat Thanh University, Ho Chi Minh City, Vietnam.

Author information

Authors and Affiliations

Faculty of Information Technology, Nguyen Tat Thanh University, 300A Nguyen Tat Thanh Street, District 4, Ho Chi Minh City, Vietnam
Tham Vo

Authors

Tham Vo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tham Vo.

Ethics declarations

Conflict of interest

Dr. Tham Vo has received a research grant from Nguyen Tat Thanh University, Ho Chi Minh City, Vietnam. She and the remaining authors have no other conflicts of interest or financial ties.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Vo, T. A novel deep fuzzy neural network semantic-enhanced method for automatic image captioning. Soft Comput 27, 14647–14658 (2023). https://doi.org/10.1007/s00500-023-09100-0

Download citation

Accepted: 03 August 2023
Published: 22 August 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s00500-023-09100-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel deep fuzzy neural network semantic-enhanced method for automatic image captioning

Abstract

Access this article

Similar content being viewed by others

An Attention-Guided Visual Semantic Fusion Method for Remote Sensing Image Captioning

A local representation-enhanced recurrent convolutional network for image captioning

Bidirectional Multimodal Recurrent Neural Networks with Refined Visual Features for Image Captioning

Data availability

Notes

References

Acknowledgement

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A novel deep fuzzy neural network semantic-enhanced method for automatic image captioning

Abstract

Access this article

Similar content being viewed by others

An Attention-Guided Visual Semantic Fusion Method for Remote Sensing Image Captioning

A local representation-enhanced recurrent convolutional network for image captioning

Bidirectional Multimodal Recurrent Neural Networks with Refined Visual Features for Image Captioning

Data availability

Notes

References

Acknowledgement

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation