Abstract
Network video typically contains a variety of information that is used by the video caption model to generate video tags. The process of creating video captions is divided into two steps—video information extraction and natural language generation. Existing models have the problem of redundant information in continuous frames when generating natural language, which affects the accuracy of the caption. As a result, this paper proposes a multimodal semantic grouping and semantic attention video caption model (VMSG). VMSG uses a novel semantic grouping method for decoding, which divides the video with the same semantics into a semantic group for decoding and predicting the next word, to reduce the redundant information of continuous video frames, which differs from the decoding mode of grouping by frame. Because the importance of each semantic group varies, we investigate a semantic attention mechanism to add weight to the semantic group and use a single-layer LSTM to simplify the model. Experiments show that VMSG outperforms some state-of-the-art models in terms of caption generation performance and alleviates the problem of redundant information in continuous video frames.
Similar content being viewed by others
Data availability
The data that support the findings of this study are available from the authors upon the reasonable request.
References
Ryu, H., Kang, S., Kang, H., et al.: Semantic grouping network for video captioning. Proc. AAAI Conf. Artif. Intell. 35(3), 2514–2522 (2021)
Wang, B., Ma, L., Zhang, W., et al.: Reconstruction network for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 7622–7631. (2018)
Jin, Q., Chen, J., Chen, S., et al.: Describing videos using multi-modal fusion. In: Proceedings of the 24th ACM international conference on multimedia, 1087–1091. (2016)
Hori, C., Hori, T., Lee T Y, et al.: Attention-based multimodal fusion for video description.In: Proceedings of the IEEE international conference on computer vision. 4193–4202. (2017)
Chen, Y., Wang, S., Zhang, W., et al.: Less is more: Picking informative frames for video captioning. In: Proceedings of the European conference on computer vision (ECCV). 358–373. (2018)
Tran, D., Bourdev, L., Fergus, R., et al.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision. 4489–4497. (2015)
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778. (2016)
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?, In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. (2018)
Giannakopoulos, T.: Pyaudioanalysis: an open-source python library for audio signal analysis[J]. PLoS ONE 10(12), e0144610 (2015)
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need[J]. Adv. Neural. Inf. Process. Syst. 30, I (2017)
Xu, J., Mei, T., Yao, T., et al.: Msr-vtt: A large video description dataset for bridging video and language. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2016). https://doi.org/10.1109/CVPR.2016.571
Papineni, K., Roukos, S., Ward, T., et al.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311–318. (2002)
Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72. (2005)
Rouge, LCY.: A package for automatic evaluation of summaries. In: Proceedings of Workshop on Text Summarization of ACL, Spain. (2004)
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 4566–4575. (2015)
Ji, W., Wang, R.: A multi-instance multi-label dual learning approach for video captioning [J]. ACM Trans. Multimid. Computing Commun. Appl. 17(2s), 1–18 (2021)
Xu, N., Mao, W., Chen, G.: Multi-interactive memory network for aspect based multimodal sentiment analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence. 33(01), 371–378. (2019)
Xu, W., Yu, J., Miao, Z., et al.: Deep reinforcement polishing network for video captioning[J]. IEEE Trans. Multimed. 23, 1772–1784 (2020)
Ji, W., Wang, R., Tian, Y., et al.: An attention based dual learning approach for video captioning[J]. Appl. Soft Comput. 117, 108332 (2022)
Yao, L., Torabi, A., Cho, K., et al.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision. 4507–4515. (2015)
Zhang, K., Li, D., Huang, J., et al.: Automated video behavior recognition of pigs using two-stream convolutional networks[J]. Sensors 20(4), 1085 (2020)
Pan, B., Cai, H., Huang, D.A., et al.: Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10870–10879. (2020)
Liu, D., Qu, X., Dong, J., et al.: Reasoning step-by-step: Temporal sentence localization in videos via deep rectification-modulation network. In: Proceedings of the 28th International Conference on Computational Linguistics. (2020)
Chen, S., Zhong, X., Li, L., et al.: Adaptively converting auxiliary attributes and textual embedding for video captioning based on BiLSTM [J]. Neural Process. Lett. 52(3), 2353–2369 (2020)
Pan, Y., Xu, J., Li, Y., et al.: Pre-training for Video Captioning Challenge 2020 Summary[J]. arXiv preprint arXiv:2008.00947.(2020)
Yang, B., Zou, Y., Liu, F., et al.: Non-autoregressive coarse-to-fine video captioning[J]. arXiv preprint arXiv:1911.12018.(2019)
Zhu, M., Duan, C., Yu, C.: Video Captioning in Compressed Video[J]. arXiv preprint arXiv:2101.00359. (2021)
Chen, J., Pan, Y., Li, Y., et al.: Retrieval augmented convolutional encoder-decoder networks for video captioning[J]. ACM Trans. Multimed. Comput. Commun. Appl. 19(1s), 1–24 (2023)
Deng, J., Li, L., Zhang, B., et al.: Syntax-guided hierarchical attention network for video captioning [J]. IEEE Trans. Circuits Syst. Video Technol. 32(2), 880–892 (2021)
Liu, F., Wu, X., You, C., et al.: Aligning source visual and target language domains for zero-shot captioning [J]. IEEE Trans. Pattern Anal. Mach. Intell. 44(12), 1 (2022)
Zhao, H., Chen, Z., Guo, L., et al.: Video captioning based on vision transformer and reinforcement learning[J]. PeerJ Comput. Sci. 8, e916 (2022)
Krishnamoorthy, N., Malkarnenkar, G., Mooney, R., et al.: Generating natural-language video descriptions using text-mined knowledge. In: Twenty-Seventh AAAI Conference on Artificial Intelligence. 328–337. (2013)
Venugopalan, S., Xu, H., Donahue, J., et al.: Translating videos to natural language using deep recurrent neural networks [J]. arXiv preprint arXiv:1412.4729. (2014)
Baraldi, L., Grana, C., Cucchiara, R.: Hierarchical boundary-aware neural encoder for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 1657–1666. (2017)
Guo, J., Tan, X., He, D., et al.: Non-autoregressive neural machine translation with enhanced decoder input. In: Proceedings of the AAAI conference on artificial intelligence. 33(01), 3723–3730. (2019)
Xu, N., Liu, A.A., Nie, W., et al.: Multi-guiding long short-term memory for video captioning [J]. Multimed. Syst. 25, 663–672 (2019)
Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543. (2014)
Lee, J.Y.: Deep multimodal embedding for video captioning[J]. Multimed. Tools Appl. 78(22), 31793–31805 (2019)
Ramanishka, V., Das, A., Park, D.H. et al.: Multimodal video description. In: Proceedings of the 24th ACM international conference on Multimedia. 1092–1096. (2016)
Nagrani, A., Yang, S., Arnab, A., et al.: Attention bottlenecks for multimodal fusion [J]. Adv. Neural. Inf. Process. Syst. 34, 14200–14213 (2021)
Liu, A.A., Xu, N., Wong, Y., et al.: Hierarchical & multimodal video captioning: discovering and transferring multimodal knowledge for vision to language [J]. Comput. Vis. Image Underst. 163, 113–125 (2017)
Chen, S., Jiang, Y.G.: Motion guided spatial attention for video captioning. In: Proceedings of the AAAI conference on artificial intelligence. 33(01), 8191–8198. (2019)
Zhang, J., Peng, Y.: Object-aware aggregation with bidirectional temporal graph for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8327–8336. (2019)
Acknowledgements
This research was supported by the National Natural Science Foundation of China (61573182) and by the Fundamental Research Funds for the Central Universities (NS2020025).
Funding
National Natural Science Foundation of China, 61573182, 62073164, Fundamental Research Funds for the Central Universities, NS2020025.
Author information
Authors and Affiliations
Contributions
Xin Yang contributed to the conception of the study, and wrote the manuscript; Xiangchen Wang performed the experiment; Xiaohui Ye contributed significantly to analysis and manuscript preparation, and performed the data analyses; Tao Li helped perform the analysis with constructive discussions.
Corresponding author
Ethics declarations
Conflict of interest
Xin Yang declares that he has no conflict of interest. Xiangchen Wang declares that he has no conflict of interest. Xiaohui Ye declares that he has no conflict of interest.
Additional information
Communicated by S. Vrochidis.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yang, X., Wang, X., Ye, X. et al. VMSG: a video caption network based on multimodal semantic grouping and semantic attention. Multimedia Systems 29, 2575–2589 (2023). https://doi.org/10.1007/s00530-023-01124-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-023-01124-8