Skip to main content

Advertisement

Log in

VMSG: a video caption network based on multimodal semantic grouping and semantic attention

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Network video typically contains a variety of information that is used by the video caption model to generate video tags. The process of creating video captions is divided into two steps—video information extraction and natural language generation. Existing models have the problem of redundant information in continuous frames when generating natural language, which affects the accuracy of the caption. As a result, this paper proposes a multimodal semantic grouping and semantic attention video caption model (VMSG). VMSG uses a novel semantic grouping method for decoding, which divides the video with the same semantics into a semantic group for decoding and predicting the next word, to reduce the redundant information of continuous video frames, which differs from the decoding mode of grouping by frame. Because the importance of each semantic group varies, we investigate a semantic attention mechanism to add weight to the semantic group and use a single-layer LSTM to simplify the model. Experiments show that VMSG outperforms some state-of-the-art models in terms of caption generation performance and alleviates the problem of redundant information in continuous video frames.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

The data that support the findings of this study are available from the authors upon the reasonable request.

References

  1. Ryu, H., Kang, S., Kang, H., et al.: Semantic grouping network for video captioning. Proc. AAAI Conf. Artif. Intell. 35(3), 2514–2522 (2021)

    Google Scholar 

  2. Wang, B., Ma, L., Zhang, W., et al.: Reconstruction network for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 7622–7631. (2018)

  3. Jin, Q., Chen, J., Chen, S., et al.: Describing videos using multi-modal fusion. In: Proceedings of the 24th ACM international conference on multimedia, 1087–1091. (2016)

  4. Hori, C., Hori, T., Lee T Y, et al.: Attention-based multimodal fusion for video description.In: Proceedings of the IEEE international conference on computer vision. 4193–4202. (2017)

  5. Chen, Y., Wang, S., Zhang, W., et al.: Less is more: Picking informative frames for video captioning. In: Proceedings of the European conference on computer vision (ECCV). 358–373. (2018)

  6. Tran, D., Bourdev, L., Fergus, R., et al.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision. 4489–4497. (2015)

  7. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778. (2016)

  8. Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?, In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. (2018)

  9. Giannakopoulos, T.: Pyaudioanalysis: an open-source python library for audio signal analysis[J]. PLoS ONE 10(12), e0144610 (2015)

    Article  Google Scholar 

  10. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need[J]. Adv. Neural. Inf. Process. Syst. 30, I (2017)

    Google Scholar 

  11. Xu, J., Mei, T., Yao, T., et al.: Msr-vtt: A large video description dataset for bridging video and language. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2016). https://doi.org/10.1109/CVPR.2016.571

    Article  Google Scholar 

  12. Papineni, K., Roukos, S., Ward, T., et al.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311–318. (2002)

  13. Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72. (2005)

  14. Rouge, LCY.: A package for automatic evaluation of summaries. In: Proceedings of Workshop on Text Summarization of ACL, Spain. (2004)

  15. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 4566–4575. (2015)

  16. Ji, W., Wang, R.: A multi-instance multi-label dual learning approach for video captioning [J]. ACM Trans. Multimid. Computing Commun. Appl. 17(2s), 1–18 (2021)

    Google Scholar 

  17. Xu, N., Mao, W., Chen, G.: Multi-interactive memory network for aspect based multimodal sentiment analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence. 33(01), 371–378. (2019)

  18. Xu, W., Yu, J., Miao, Z., et al.: Deep reinforcement polishing network for video captioning[J]. IEEE Trans. Multimed. 23, 1772–1784 (2020)

    Article  Google Scholar 

  19. Ji, W., Wang, R., Tian, Y., et al.: An attention based dual learning approach for video captioning[J]. Appl. Soft Comput. 117, 108332 (2022)

    Article  Google Scholar 

  20. Yao, L., Torabi, A., Cho, K., et al.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision. 4507–4515. (2015)

  21. Zhang, K., Li, D., Huang, J., et al.: Automated video behavior recognition of pigs using two-stream convolutional networks[J]. Sensors 20(4), 1085 (2020)

    Article  Google Scholar 

  22. Pan, B., Cai, H., Huang, D.A., et al.: Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10870–10879. (2020)

  23. Liu, D., Qu, X., Dong, J., et al.: Reasoning step-by-step: Temporal sentence localization in videos via deep rectification-modulation network. In: Proceedings of the 28th International Conference on Computational Linguistics. (2020)

  24. Chen, S., Zhong, X., Li, L., et al.: Adaptively converting auxiliary attributes and textual embedding for video captioning based on BiLSTM [J]. Neural Process. Lett. 52(3), 2353–2369 (2020)

    Article  Google Scholar 

  25. Pan, Y., Xu, J., Li, Y., et al.: Pre-training for Video Captioning Challenge 2020 Summary[J]. arXiv preprint arXiv:2008.00947.(2020)

  26. Yang, B., Zou, Y., Liu, F., et al.: Non-autoregressive coarse-to-fine video captioning[J]. arXiv preprint arXiv:1911.12018.(2019)

  27. Zhu, M., Duan, C., Yu, C.: Video Captioning in Compressed Video[J]. arXiv preprint arXiv:2101.00359. (2021)

  28. Chen, J., Pan, Y., Li, Y., et al.: Retrieval augmented convolutional encoder-decoder networks for video captioning[J]. ACM Trans. Multimed. Comput. Commun. Appl. 19(1s), 1–24 (2023)

    Article  Google Scholar 

  29. Deng, J., Li, L., Zhang, B., et al.: Syntax-guided hierarchical attention network for video captioning [J]. IEEE Trans. Circuits Syst. Video Technol. 32(2), 880–892 (2021)

    Article  Google Scholar 

  30. Liu, F., Wu, X., You, C., et al.: Aligning source visual and target language domains for zero-shot captioning [J]. IEEE Trans. Pattern Anal. Mach. Intell. 44(12), 1 (2022)

    Article  Google Scholar 

  31. Zhao, H., Chen, Z., Guo, L., et al.: Video captioning based on vision transformer and reinforcement learning[J]. PeerJ Comput. Sci. 8, e916 (2022)

    Article  Google Scholar 

  32. Krishnamoorthy, N., Malkarnenkar, G., Mooney, R., et al.: Generating natural-language video descriptions using text-mined knowledge. In: Twenty-Seventh AAAI Conference on Artificial Intelligence. 328–337. (2013)

  33. Venugopalan, S., Xu, H., Donahue, J., et al.: Translating videos to natural language using deep recurrent neural networks [J]. arXiv preprint arXiv:1412.4729. (2014)

  34. Baraldi, L., Grana, C., Cucchiara, R.: Hierarchical boundary-aware neural encoder for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 1657–1666. (2017)

  35. Guo, J., Tan, X., He, D., et al.: Non-autoregressive neural machine translation with enhanced decoder input. In: Proceedings of the AAAI conference on artificial intelligence. 33(01), 3723–3730. (2019)

  36. Xu, N., Liu, A.A., Nie, W., et al.: Multi-guiding long short-term memory for video captioning [J]. Multimed. Syst. 25, 663–672 (2019)

    Article  Google Scholar 

  37. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543. (2014)

  38. Lee, J.Y.: Deep multimodal embedding for video captioning[J]. Multimed. Tools Appl. 78(22), 31793–31805 (2019)

    Article  Google Scholar 

  39. Ramanishka, V., Das, A., Park, D.H. et al.: Multimodal video description. In: Proceedings of the 24th ACM international conference on Multimedia. 1092–1096. (2016)

  40. Nagrani, A., Yang, S., Arnab, A., et al.: Attention bottlenecks for multimodal fusion [J]. Adv. Neural. Inf. Process. Syst. 34, 14200–14213 (2021)

    Google Scholar 

  41. Liu, A.A., Xu, N., Wong, Y., et al.: Hierarchical & multimodal video captioning: discovering and transferring multimodal knowledge for vision to language [J]. Comput. Vis. Image Underst. 163, 113–125 (2017)

    Article  Google Scholar 

  42. Chen, S., Jiang, Y.G.: Motion guided spatial attention for video captioning. In: Proceedings of the AAAI conference on artificial intelligence. 33(01), 8191–8198. (2019)

  43. Zhang, J., Peng, Y.: Object-aware aggregation with bidirectional temporal graph for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8327–8336. (2019)

Download references

Acknowledgements

This research was supported by the National Natural Science Foundation of China (61573182) and by the Fundamental Research Funds for the Central Universities (NS2020025).

Funding

National Natural Science Foundation of China, 61573182, 62073164, Fundamental Research Funds for the Central Universities, NS2020025.

Author information

Authors and Affiliations

Authors

Contributions

Xin Yang contributed to the conception of the study, and wrote the manuscript; Xiangchen Wang performed the experiment; Xiaohui Ye contributed significantly to analysis and manuscript preparation, and performed the data analyses; Tao Li helped perform the analysis with constructive discussions.

Corresponding author

Correspondence to Xin Yang.

Ethics declarations

Conflict of interest

Xin Yang declares that he has no conflict of interest. Xiangchen Wang declares that he has no conflict of interest. Xiaohui Ye declares that he has no conflict of interest.

Additional information

Communicated by S. Vrochidis.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, X., Wang, X., Ye, X. et al. VMSG: a video caption network based on multimodal semantic grouping and semantic attention. Multimedia Systems 29, 2575–2589 (2023). https://doi.org/10.1007/s00530-023-01124-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-023-01124-8

Keywords

Navigation