research-article

Deep Semantic and Attentive Network for Unsupervised Video Summarization

Authors:
Sheng-Hua Zhong

College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China

College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
View Profile

,
Jingxu Lin

College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China

College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
View Profile

,
Jianglin Lu

College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China

College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
View Profile

,
Ahmed Fares

College of Computer Science and Software Engineering, Shenzhen University, China and Department of Electrical Engineering, the Computer Systems Engineering Program, Faculty of Engineering at Shoubra, Benha University, Cairo, Egypt

College of Computer Science and Software Engineering, Shenzhen University, China and Department of Electrical Engineering, the Computer Systems Engineering Program, Faculty of Engineering at Shoubra, Benha University, Cairo, Egypt
View Profile

,
Tongwei Ren

State Key Laboratory for Novel Software Technology, Nanjing University, Egypt

State Key Laboratory for Novel Software Technology, Nanjing University, Egypt
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 18 Issue 2Article No.: 55pp 1–21https://doi.org/10.1145/3477538

Published:16 February 2022Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

With the rapid growth of video data, video summarization is a promising approach to shorten a lengthy video into a compact version. Although supervised summarization approaches have achieved state-of-the-art performance, they require frame-level annotated labels. Such an annotation process is time-consuming and tedious. In this article, we propose a novel deep summarization framework named Deep Semantic and Attentive Network for Video Summarization (DSAVS) that can select the most semantically representative summary by minimizing the distance between video representation and text representation without any frame-level labels. Another challenge associated with video summarization tasks mainly originates from the difficulty of considering temporal information over a long time. Long Short-Term Memory (LSTM) performs well for temporal dependencies modeling but does not work well with long video clips. Therefore, we introduce a self-attention mechanism into our summarization framework to capture the long-range temporal dependencies among the frames. Extensive experiments on two popular benchmark datasets, i.e., SumMe and TVSum, show that our proposed framework outperforms other state-of-the-art unsupervised approaches and even most supervised methods.

REFERENCES

[1] Apostolidis Evlampios, Adamantidou Eleni, Metsai Alexandros I., Mezaris Vasileios, and Patras Ioannis. 2020. Unsupervised video summarization via attention-driven adversarial learning. In International Conference on Multimedia Modeling. Springer, 492–504.Google ScholarDigital Library
[2] Cong Yang, Yuan Junsong, and Luo Jiebo. 2011. Towards scalable summarization of consumer videos via sparse dictionary selection. IEEE Transactions on Multimedia 14, 1 (2011), 66–75. Google ScholarDigital Library
[3] Avila Sandra Eliza Fontes De, Lopes Ana Paula Brandão, Jr. Antonio da Luz, and Araújo Arnaldo de Albuquerque. 2011. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters 32, 1 (2011), 56–68. Google ScholarDigital Library
[4] Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248–255.Google ScholarCross Ref
[5] Ejaz Naveed, Mehmood Irfan, and Baik Sung Wook. 2013. Efficient visual attention based framework for extracting key frames from videos. Signal Processing: Image Communication 28, 1 (2013), 34–44. Google ScholarDigital Library
[6] Frome Andrea, Corrado Greg S., Shlens Jon, Bengio Samy, Dean Jeff, Ranzato Marc’Aurelio, and Mikolov Tomas. 2013. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems. 2121–2129. Google ScholarDigital Library
[7] Gao Yue, Wang Meng, Zha Zheng-Jun, Shen Jialie, Li Xuelong, and Wu Xindong. 2012. Visual-textual joint relevance learning for tag-based social image search. IEEE Transactions on Image Processing 22, 1 (2012), 363–376. Google ScholarDigital Library
[8] Gygli Michael, Grabner Helmut, Riemenschneider Hayko, and Gool Luc Van. 2014. Creating summaries from user videos. In European Conference on Computer Vision. Springer, 505–520.Google ScholarCross Ref
[9] Hara Kensho, Kataoka Hirokatsu, and Satoh Yutaka. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and Imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546–6555.Google ScholarCross Ref
[10] He Xufeng, Hua Yang, Song Tao, Zhang Zongpu, Xue Zhengui, Ma Ruhui, Robertson Neil, and Guan Haibing. 2019. Unsupervised video summarization with attentive conditional generative adversarial networks. In Proceedings of the 27th ACM International Conference on Multimedia. 2296–2304. Google ScholarDigital Library
[11] Huang Jia-Hong and Worring Marcel. 2020. Query-controllable video summarization. In Proceedings of the 2020 International Conference on Multimedia Retrieval. 242–250. Google ScholarDigital Library
[12] Huang Lun, Wang Wenmin, Chen Jie, and Wei Xiao-Yong. 2019. Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 4634–4643.Google ScholarCross Ref
[13] Ji Zhong, Jiao Fang, Pang Yanwei, and Shao Ling. 2020. Deep attentive and semantic preserving video summarization. Neurocomputing 405 (2020), 200–207.Google ScholarCross Ref
[14] Ji Zhong, Xiong Kailin, Pang Yanwei, and Li Xuelong. 2020. Video summarization with attention-based encoder-decoder networks. IEEE Transactions on Circuits and Systems for Video Technology 30, 6 (2020), 1709–1717.Google ScholarCross Ref
[15] Jiang Qing-Yuan and Li Wu-Jun. 2017. Deep cross-modal hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3232–3240.Google ScholarCross Ref
[16] Jung Yunjae, Cho Donghyeon, Kim Dahun, Woo Sanghyun, and Kweon In So. 2019. Discriminative feature learning for unsupervised video summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8537–8544. Google ScholarDigital Library
[17] Kendall Maurice G.. 1945. The treatment of ties in ranking problems. Biometrika 33, 3 (1945), 239–251.Google ScholarCross Ref
[18] Kingma Diederik P. and Ba Jimmy. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.Google Scholar
[19] Kiros Ryan, Zhu Yukun, Salakhutdinov Russ R., Zemel Richard, Urtasun Raquel, Torralba Antonio, and Fidler Sanja. 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems. 3294–3302. Google ScholarDigital Library
[20] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2012. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012), 1097–1105. Google ScholarDigital Library
[21] Li Xiangpeng, Song Jingkuan, Gao Lianli, Liu Xianglong, Huang Wenbing, He Xiangnan, and Gan Chuang. 2019. Beyond RNNs: Positional self-attention with co-attention for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8658–8665. Google ScholarDigital Library
[22] Mahasseni Behrooz, Lam Michael, and Todorovic Sinisa. 2017. Unsupervised video summarization with adversarial LSTM networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 202–211.Google ScholarCross Ref
[23] Mei Shaohui, Guan Genliang, Wang Zhiyong, Wan Shuai, He Mingyi, and Feng David Dagan. 2015. Video summarization via minimum sparse reconstruction. Pattern Recognition 48, 2 (2015), 522–533. Google ScholarDigital Library
[24] Mikolov Tomas, Sutskever Ilya, Chen Kai, Corrado Greg S., and Dean Jeff. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111–3119. Google ScholarDigital Library
[25] Otani Mayu, Nakashima Yuta, Rahtu Esa, and Heikkila Janne. 2019. Rethinking the evaluation of video summaries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7596–7604.Google ScholarCross Ref
[26] Potapov Danila, Douze Matthijs, Harchaoui Zaid, and Schmid Cordelia. 2014. Category-specific video summarization. In European Conference on Computer Vision. Springer, 540–555.Google ScholarCross Ref
[27] Ramanishka Vasili, Das Abir, Zhang Jianming, and Saenko Kate. 2017. Top-down visual saliency guided by captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7206–7215.Google ScholarCross Ref
[28] Rasiwasia Nikhil, Pereira Jose Costa, Coviello Emanuele, Doyle Gabriel, Lanckriet Gert R. G., Levy Roger, and Vasconcelos Nuno. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM International Conference on Multimedia. 251–260. Google ScholarDigital Library
[29] Rochan Mrigank and Wang Yang. 2019. Video summarization by learning from unpaired data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7902–7911.Google ScholarCross Ref
[30] Rochan Mrigank, Ye Linwei, and Wang Yang. 2018. Video summarization using fully convolutional sequence networks. In Proceedings of the European Conference on Computer Vision (ECCV’18). 347–363.Google ScholarCross Ref
[31] Sharghi Aidean, Gong Boqing, and Shah Mubarak. 2016. Query-focused extractive video summarization. In European Conference on Computer Vision. Springer, 3–19.Google ScholarCross Ref
[32] Shen Tao, Zhou Tianyi, Long Guodong, Jiang Jing, Pan Shirui, and Zhang Chengqi. 2018. Disan: Directional self-attention network for RNN/CNN-free language understanding. In 32nd AAAI Conference on Artificial Intelligence. Google ScholarDigital Library
[33] Song Yale, Vallmitjana Jordi, Stent Amanda, and Jaimes Alejandro. 2015. TVSum: Summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5179–5187.Google Scholar
[34] Szegedy Christian, Liu Wei, Jia Yangqing, Sermanet Pierre, Reed Scott, Anguelov Dragomir, Erhan Dumitru, Vanhoucke Vincent, and Rabinovich Andrew. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.Google ScholarCross Ref
[35] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008. Google ScholarDigital Library
[36] Venugopalan Subhashini, Rohrbach Marcus, Donahue Jeffrey, Mooney Raymond, Darrell Trevor, and Saenko Kate. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE International Conference on Computer Vision. 4534–4542. Google ScholarDigital Library
[37] Wei Huawei, Ni Bingbing, Yan Yichao, Yu Huanyu, Yang Xiaokang, and Yao Chen. 2018. Video summarization via semantic attended networks. In 32nd AAAI Conference on Artificial Intelligence. 216–223. Google ScholarDigital Library
[38] Yeung Serena, Fathi Alireza, and Li Fei-Fei. 2014. Videoset: Video summary evaluation through text. arXiv preprint arXiv:1406.5824.Google Scholar
[39] Yuan Li, Tay Francis E. H., Li Ping, Zhou Li, and Feng Jiashi. 2019. Cycle-sum: Cycle-consistent adversarial LSTM networks for unsupervised video summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9143–9150. Google ScholarDigital Library
[40] Ng Joe Yue-Hei, Hausknecht Matthew, Vijayanarasimhan Sudheendra, Vinyals Oriol, Monga Rajat, and Toderici George. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4694–4702.Google Scholar
[41] Zhang Ke, Chao Wei-Lun, Sha Fei, and Grauman Kristen. 2016. Video summarization with long short-term memory. In European Conference on Computer Vision. Springer, 766–782.Google ScholarCross Ref
[42] Zhang Ying and Lu Huchuan. 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV’18). 686–701.Google ScholarDigital Library
[43] Zhao Bin, Li Xuelong, and Lu Xiaoqiang. 2017. Hierarchical recurrent neural network for video summarization. In Proceedings of the 25th ACM International Conference on Multimedia. 863–871. Google ScholarDigital Library
[44] Zhao Bin, Li Xuelong, and Lu Xiaoqiang. 2018. HSA-RNN: Hierarchical structure-adaptive RNN for video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7405–7414.Google ScholarCross Ref
[45] Zhong Sheng-hua, Wu Jiaxin, and Jiang Jianmin. 2019. Video summarization via spatio-temporal deep architecture. Neurocomputing 332 (2019), 224–235.Google ScholarDigital Library
[46] Zhou Kaiyang, Qiao Yu, and Xiang Tao. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In 32nd AAAI Conference on Artificial Intelligence. 7582–7589. Google ScholarDigital Library
[47] Zhu Yukun, Kiros Ryan, Zemel Rich, Salakhutdinov Ruslan, Urtasun Raquel, Torralba Antonio, and Fidler Sanja. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision. 19–27. Google ScholarDigital Library
[48] Zwillinger Daniel and Kokoska Stephen. 1999. CRC Standard Probability and Statistics Tables and Formulae. CRC Press.Google Scholar

Index Terms

Deep Semantic and Attentive Network for Unsupervised Video Summarization
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Video summarization

Recommendations

Unsupervised Video Summarization with Attentive Conditional Generative Adversarial Networks
MM '19: Proceedings of the 27th ACM International Conference on Multimedia

With the rapid growth of video data, video summarization technique plays a key role in reducing people's efforts to explore the content of videos by generating concise but informative summaries. Though supervised video summarization approaches have been ...
Read More
Hierarchical Recurrent Neural Network for Video Summarization
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Exploiting the temporal dependency among video frames or subshots is very important for the task of video summarization. Practically, RNN is good at temporal dependency modeling, and has achieved overwhelming performance in many video-based tasks, such ...
Read More
Self-attention binary neural tree for video summarization
Highlights
- A self-attention binary neural tree (SABTNet) is proposed for video summarization.
Abstract
In this paper, we address the problem of shot-level video summarization, which aims at selecting a subset of video shots as a summary to represent the original video contents compactly and completely. Most existing methods rely on ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 18, Issue 2
May 2022
494 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3505207
Editor:
Alberto Del Bimbo
University of Firenze, Italy
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 February 2022
- Accepted: 1 July 2021
- Revised: 1 May 2021
- Received: 1 November 2020
Published in tomm Volume 18, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Video summarization
visual-semantic embedding
self-attention
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 1,034
  Total Downloads
- Downloads (Last 12 months)289
- Downloads (Last 6 weeks)20
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

Deep Semantic and Attentive Network for Unsupervised Video Summarization

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Unsupervised Video Summarization with Attentive Conditional Generative Adversarial Networks

Hierarchical Recurrent Neural Network for Video Summarization

Self-attention binary neural tree for video summarization