research-article

Underwater Image Captioning Based on Feature Fusion

Authors:
Li Li

College of Oceanography and Space Informatics, China University of Petroleum (East China), China

College of Oceanography and Space Informatics, China University of Petroleum (East China), China

0009-0008-6618-7863
View Profile

,
Yanan Wei

School of Biomedical Engineering, Capital Medical University, China

School of Biomedical Engineering, Capital Medical University, China

0009-0004-1517-115X
View Profile

,
Peng Ren

College of Oceanography and Space Informatics, China University of Petroleum (East China), China

College of Oceanography and Space Informatics, China University of Petroleum (East China), China

0000-0003-3949-985X
View Profile

ICIGP '24: Proceedings of the 2024 7th International Conference on Image and Graphics ProcessingJanuary 2024Pages 322–326https://doi.org/10.1145/3647649.3647700

Published:03 May 2024Publication History

ICIGP '24: Proceedings of the 2024 7th International Conference on Image and Graphics Processing

Pages 322–326

ABSTRACT

Image captioning employs artificial intelligence to translate visual content into natural language text descriptions. Underwater image captioning offers specialized interpretation for scenarios such as underwater environmental monitoring, underwater archaeology, and offshore platforms. It proves effective in compressing information for the real-time transmission of extensive underwater images via underwater acoustic communication. In this article, we annotate underwater image caption dataset for this task, and create a baseline using the encoder-decoder neural image caption model. It output complete sentences related to image content. The description of underwater images mainly focuses on the underwater scene and objects. The object detection model based on the Faster RCNN is applied to extract the full-image features and regional features corresponding to the target in the image. For the caption model, we enhanced the input features of the language generator by combining global information, regional details, contextual cues, and pre-ordered text information through feature fusion. It enables the generator to output precise semantic expressions related to salient objects. The method was applied to the annotated underwater image caption dataset, resulting in more accurate descriptions of underwater targets compared to sentences generated by a basic neural network model. The evaluation metrics reflected higher scores, affirming the effectiveness of our approach.

References

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6077–6086.Google Scholar
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72.Google Scholar
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).Google Scholar
Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV 11. Springer, 15–29.Google ScholarCross Ref
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128–3137.Google ScholarCross Ref
Chongyi Li, Chunle Guo, Wenqi Ren, Runmin Cong, Junhui Hou, Sam Kwong, and Dacheng Tao. 2019. An underwater image enhancement benchmark dataset and beyond. IEEE Transactions on Image Processing 29 (2019), 4376–4389.Google ScholarCross Ref
Siming Li, Girish Kulkarni, Tamara Berg, Alexander Berg, and Yejin Choi. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the fifteenth conference on computational natural language learning. 220–228.Google ScholarDigital Library
Ying Liu, Luyao Geng, Weidong Zhang, Yanchao Gong, and Zhijie Xu. 2021. Survey of video based small target detection. Journal of Image and Graphics 9, 4 (2021), 122–134.Google ScholarCross Ref
Edisanter Lo. 2019. Target detection algorithms in hyperspectral imaging based on discriminant analysis. Journal of Image and Graphics 7, 4 (2019), 140–144.Google ScholarCross Ref
Vicente Ordonez, Girish Kulkarni, and Tamara Berg. 2011. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems 24 (2011).Google Scholar
Jia-Yu Pan, Hyung-Jeong Yang, Pinar Duygulu, and Christos Faloutsos. 2004. Automatic image captioning. In 2004 IEEE International Conference on Multimedia and Expo. 1987–1990.Google Scholar
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.Google Scholar
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015).Google Scholar
Haşim Sak, Andrew Senior, and Françoise Beaufays. 2014. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128 (2014).Google Scholar
Florian Spiess, Lucas Reinhart, Norbert Strobel, Dennis Kaiser, Samuel Kounev, and Tobias Kaupp. 2021. People detection with depth silhouettes and convolutional neural networks on a mobile robot. Journal of Image and Graphics 9, 4 (2021), 135–139.Google ScholarCross Ref
Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Silvia Cascianelli, Giuseppe Fiameni, and Rita Cucchiara. 2022. From Show to Tell: A Survey on Deep Learning-based Image Captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 1 (2022), 539–559.Google ScholarCross Ref
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4566–4575.Google ScholarCross Ref
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2017. Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (2017), 652–663. https://doi.org/10.1109/TPAMI.2016.2587640Google ScholarDigital Library
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. PMLR, 2048–2057.Google Scholar
Sanqiang Zhao, Piyush Sharma, Tomer Levinboim, and Radu Soricut. 2019. Informative image captioning with external sources of information. arXiv preprint arXiv:1906.08876 (2019).Google Scholar

Index Terms

Underwater Image Captioning Based on Feature Fusion
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

Underwater image enhancement based on color restoration and dual image wavelet fusion
Abstract
Due to the severe light absorption and scattering, underwater images often exhibit problems such as low contrast, detail blurring, color attenuation, and low illumination. To address these issues, this paper presents a two-step ...
Highlights
- The paper presents an approach by integrating data-driven deep learning and hand-crafted image enhancement for the single underwater image enhancement. We ...
Read More
Underwater image enhancement method via multi-feature prior fusion
Abstract
The information in a single underwater image is insufficient due to the complexity of the underwater environment, which makes it challenging to meet the expectations of marine research. In this paper, we proposed a visual quality enhancement ...
Read More
A survey on deep neural network-based image captioning

Image captioning is a hot topic of image understanding, and it is composed of two natural parts ("look" and "language expression") which correspond to the two most important fields of artificial intelligence ("machine vision" and "natural language ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICIGP '24: Proceedings of the 2024 7th International Conference on Image and Graphics Processing
January 2024
480 pages
ISBN:9798400716720
DOI:10.1145/3647649

Copyright © 2024 ACM
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 May 2024
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
features fusion
object detection
the Neural Image Caption model
underwater image captioning
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 4
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Underwater Image Captioning Based on Feature Fusion

ICIGP '24: Proceedings of the 2024 7th International Conference on Image and Graphics Processing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Underwater image enhancement based on color restoration and dual image wavelet fusion

Underwater image enhancement method via multi-feature prior fusion

A survey on deep neural network-based image captioning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media