VMSG: a video caption network based on multimodal semantic grouping and semantic attention

Yang, Xin; Wang, Xiangchen; Ye, Xiaohui; Li, Tao

doi:10.1007/s00530-023-01124-8

VMSG: a video caption network based on multimodal semantic grouping and semantic attention

Regular Paper
Published: 13 June 2023

Volume 29, pages 2575–2589, (2023)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Xin Yang¹,
Xiangchen Wang¹,
Xiaohui Ye¹ &
…
Tao Li¹

168 Accesses
Explore all metrics

Abstract

Network video typically contains a variety of information that is used by the video caption model to generate video tags. The process of creating video captions is divided into two steps—video information extraction and natural language generation. Existing models have the problem of redundant information in continuous frames when generating natural language, which affects the accuracy of the caption. As a result, this paper proposes a multimodal semantic grouping and semantic attention video caption model (VMSG). VMSG uses a novel semantic grouping method for decoding, which divides the video with the same semantics into a semantic group for decoding and predicting the next word, to reduce the redundant information of continuous video frames, which differs from the decoding mode of grouping by frame. Because the importance of each semantic group varies, we investigate a semantic attention mechanism to add weight to the semantic group and use a single-layer LSTM to simplify the model. Experiments show that VMSG outperforms some state-of-the-art models in terms of caption generation performance and alleviates the problem of redundant information in continuous video frames.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video summarization using deep learning techniques: a detailed analysis and investigation

Article 15 March 2023

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Article 11 October 2019

Visual attention network

Article Open access 28 July 2023

Data availability

The data that support the findings of this study are available from the authors upon the reasonable request.

References

Ryu, H., Kang, S., Kang, H., et al.: Semantic grouping network for video captioning. Proc. AAAI Conf. Artif. Intell. 35(3), 2514–2522 (2021)
Google Scholar
Wang, B., Ma, L., Zhang, W., et al.: Reconstruction network for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 7622–7631. (2018)
Jin, Q., Chen, J., Chen, S., et al.: Describing videos using multi-modal fusion. In: Proceedings of the 24th ACM international conference on multimedia, 1087–1091. (2016)
Hori, C., Hori, T., Lee T Y, et al.: Attention-based multimodal fusion for video description.In: Proceedings of the IEEE international conference on computer vision. 4193–4202. (2017)
Chen, Y., Wang, S., Zhang, W., et al.: Less is more: Picking informative frames for video captioning. In: Proceedings of the European conference on computer vision (ECCV). 358–373. (2018)
Tran, D., Bourdev, L., Fergus, R., et al.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision. 4489–4497. (2015)
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778. (2016)
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?, In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. (2018)
Giannakopoulos, T.: Pyaudioanalysis: an open-source python library for audio signal analysis[J]. PLoS ONE 10(12), e0144610 (2015)
Article Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need[J]. Adv. Neural. Inf. Process. Syst. 30, I (2017)
Google Scholar
Xu, J., Mei, T., Yao, T., et al.: Msr-vtt: A large video description dataset for bridging video and language. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2016). https://doi.org/10.1109/CVPR.2016.571
Article Google Scholar
Papineni, K., Roukos, S., Ward, T., et al.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311–318. (2002)
Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72. (2005)
Rouge, LCY.: A package for automatic evaluation of summaries. In: Proceedings of Workshop on Text Summarization of ACL, Spain. (2004)
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 4566–4575. (2015)
Ji, W., Wang, R.: A multi-instance multi-label dual learning approach for video captioning [J]. ACM Trans. Multimid. Computing Commun. Appl. 17(2s), 1–18 (2021)
Google Scholar
Xu, N., Mao, W., Chen, G.: Multi-interactive memory network for aspect based multimodal sentiment analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence. 33(01), 371–378. (2019)
Xu, W., Yu, J., Miao, Z., et al.: Deep reinforcement polishing network for video captioning[J]. IEEE Trans. Multimed. 23, 1772–1784 (2020)
Article Google Scholar
Ji, W., Wang, R., Tian, Y., et al.: An attention based dual learning approach for video captioning[J]. Appl. Soft Comput. 117, 108332 (2022)
Article Google Scholar
Yao, L., Torabi, A., Cho, K., et al.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision. 4507–4515. (2015)
Zhang, K., Li, D., Huang, J., et al.: Automated video behavior recognition of pigs using two-stream convolutional networks[J]. Sensors 20(4), 1085 (2020)
Article Google Scholar
Pan, B., Cai, H., Huang, D.A., et al.: Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10870–10879. (2020)
Liu, D., Qu, X., Dong, J., et al.: Reasoning step-by-step: Temporal sentence localization in videos via deep rectification-modulation network. In: Proceedings of the 28th International Conference on Computational Linguistics. (2020)
Chen, S., Zhong, X., Li, L., et al.: Adaptively converting auxiliary attributes and textual embedding for video captioning based on BiLSTM [J]. Neural Process. Lett. 52(3), 2353–2369 (2020)
Article Google Scholar
Pan, Y., Xu, J., Li, Y., et al.: Pre-training for Video Captioning Challenge 2020 Summary[J]. arXiv preprint arXiv:2008.00947.(2020)
Yang, B., Zou, Y., Liu, F., et al.: Non-autoregressive coarse-to-fine video captioning[J]. arXiv preprint arXiv:1911.12018.(2019)
Zhu, M., Duan, C., Yu, C.: Video Captioning in Compressed Video[J]. arXiv preprint arXiv:2101.00359. (2021)
Chen, J., Pan, Y., Li, Y., et al.: Retrieval augmented convolutional encoder-decoder networks for video captioning[J]. ACM Trans. Multimed. Comput. Commun. Appl. 19(1s), 1–24 (2023)
Article Google Scholar
Deng, J., Li, L., Zhang, B., et al.: Syntax-guided hierarchical attention network for video captioning [J]. IEEE Trans. Circuits Syst. Video Technol. 32(2), 880–892 (2021)
Article Google Scholar
Liu, F., Wu, X., You, C., et al.: Aligning source visual and target language domains for zero-shot captioning [J]. IEEE Trans. Pattern Anal. Mach. Intell. 44(12), 1 (2022)
Article Google Scholar
Zhao, H., Chen, Z., Guo, L., et al.: Video captioning based on vision transformer and reinforcement learning[J]. PeerJ Comput. Sci. 8, e916 (2022)
Article Google Scholar
Krishnamoorthy, N., Malkarnenkar, G., Mooney, R., et al.: Generating natural-language video descriptions using text-mined knowledge. In: Twenty-Seventh AAAI Conference on Artificial Intelligence. 328–337. (2013)
Venugopalan, S., Xu, H., Donahue, J., et al.: Translating videos to natural language using deep recurrent neural networks [J]. arXiv preprint arXiv:1412.4729. (2014)
Baraldi, L., Grana, C., Cucchiara, R.: Hierarchical boundary-aware neural encoder for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 1657–1666. (2017)
Guo, J., Tan, X., He, D., et al.: Non-autoregressive neural machine translation with enhanced decoder input. In: Proceedings of the AAAI conference on artificial intelligence. 33(01), 3723–3730. (2019)
Xu, N., Liu, A.A., Nie, W., et al.: Multi-guiding long short-term memory for video captioning [J]. Multimed. Syst. 25, 663–672 (2019)
Article Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543. (2014)
Lee, J.Y.: Deep multimodal embedding for video captioning[J]. Multimed. Tools Appl. 78(22), 31793–31805 (2019)
Article Google Scholar
Ramanishka, V., Das, A., Park, D.H. et al.: Multimodal video description. In: Proceedings of the 24th ACM international conference on Multimedia. 1092–1096. (2016)
Nagrani, A., Yang, S., Arnab, A., et al.: Attention bottlenecks for multimodal fusion [J]. Adv. Neural. Inf. Process. Syst. 34, 14200–14213 (2021)
Google Scholar
Liu, A.A., Xu, N., Wong, Y., et al.: Hierarchical & multimodal video captioning: discovering and transferring multimodal knowledge for vision to language [J]. Comput. Vis. Image Underst. 163, 113–125 (2017)
Article Google Scholar
Chen, S., Jiang, Y.G.: Motion guided spatial attention for video captioning. In: Proceedings of the AAAI conference on artificial intelligence. 33(01), 8191–8198. (2019)
Zhang, J., Peng, Y.: Object-aware aggregation with bidirectional temporal graph for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8327–8336. (2019)

Download references

Acknowledgements

This research was supported by the National Natural Science Foundation of China (61573182) and by the Fundamental Research Funds for the Central Universities (NS2020025).

Funding

National Natural Science Foundation of China, 61573182, 62073164, Fundamental Research Funds for the Central Universities, NS2020025.

Author information

Authors and Affiliations

College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, China
Xin Yang, Xiangchen Wang, Xiaohui Ye & Tao Li

Authors

Xin Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xiangchen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohui Ye
View author publications
You can also search for this author in PubMed Google Scholar
Tao Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Xin Yang contributed to the conception of the study, and wrote the manuscript; Xiangchen Wang performed the experiment; Xiaohui Ye contributed significantly to analysis and manuscript preparation, and performed the data analyses; Tao Li helped perform the analysis with constructive discussions.

Corresponding author

Correspondence to Xin Yang.

Ethics declarations

Conflict of interest

Xin Yang declares that he has no conflict of interest. Xiangchen Wang declares that he has no conflict of interest. Xiaohui Ye declares that he has no conflict of interest.

Additional information

Communicated by S. Vrochidis.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, X., Wang, X., Ye, X. et al. VMSG: a video caption network based on multimodal semantic grouping and semantic attention. Multimedia Systems 29, 2575–2589 (2023). https://doi.org/10.1007/s00530-023-01124-8

Download citation

Received: 10 April 2022
Accepted: 03 June 2023
Published: 13 June 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s00530-023-01124-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

VMSG: a video caption network based on multimodal semantic grouping and semantic attention

Abstract

Access this article

Similar content being viewed by others

Video summarization using deep learning techniques: a detailed analysis and investigation

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Visual attention network

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

VMSG: a video caption network based on multimodal semantic grouping and semantic attention

Abstract

Access this article

Similar content being viewed by others

Video summarization using deep learning techniques: a detailed analysis and investigation

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Visual attention network

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation