skip to main content
research-article

Deep Semantic and Attentive Network for Unsupervised Video Summarization

Authors Info & Claims
Published:16 February 2022Publication History
Skip Abstract Section

Abstract

With the rapid growth of video data, video summarization is a promising approach to shorten a lengthy video into a compact version. Although supervised summarization approaches have achieved state-of-the-art performance, they require frame-level annotated labels. Such an annotation process is time-consuming and tedious. In this article, we propose a novel deep summarization framework named Deep Semantic and Attentive Network for Video Summarization (DSAVS) that can select the most semantically representative summary by minimizing the distance between video representation and text representation without any frame-level labels. Another challenge associated with video summarization tasks mainly originates from the difficulty of considering temporal information over a long time. Long Short-Term Memory (LSTM) performs well for temporal dependencies modeling but does not work well with long video clips. Therefore, we introduce a self-attention mechanism into our summarization framework to capture the long-range temporal dependencies among the frames. Extensive experiments on two popular benchmark datasets, i.e., SumMe and TVSum, show that our proposed framework outperforms other state-of-the-art unsupervised approaches and even most supervised methods.

REFERENCES

  1. [1] Apostolidis Evlampios, Adamantidou Eleni, Metsai Alexandros I., Mezaris Vasileios, and Patras Ioannis. 2020. Unsupervised video summarization via attention-driven adversarial learning. In International Conference on Multimedia Modeling. Springer, 492504.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Cong Yang, Yuan Junsong, and Luo Jiebo. 2011. Towards scalable summarization of consumer videos via sparse dictionary selection. IEEE Transactions on Multimedia 14, 1 (2011), 6675. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Avila Sandra Eliza Fontes De, Lopes Ana Paula Brandão, Jr. Antonio da Luz, and Araújo Arnaldo de Albuquerque. 2011. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters 32, 1 (2011), 5668. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248255.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Ejaz Naveed, Mehmood Irfan, and Baik Sung Wook. 2013. Efficient visual attention based framework for extracting key frames from videos. Signal Processing: Image Communication 28, 1 (2013), 3444. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Frome Andrea, Corrado Greg S., Shlens Jon, Bengio Samy, Dean Jeff, Ranzato Marc’Aurelio, and Mikolov Tomas. 2013. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems. 21212129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Gao Yue, Wang Meng, Zha Zheng-Jun, Shen Jialie, Li Xuelong, and Wu Xindong. 2012. Visual-textual joint relevance learning for tag-based social image search. IEEE Transactions on Image Processing 22, 1 (2012), 363376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Gygli Michael, Grabner Helmut, Riemenschneider Hayko, and Gool Luc Van. 2014. Creating summaries from user videos. In European Conference on Computer Vision. Springer, 505520.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Hara Kensho, Kataoka Hirokatsu, and Satoh Yutaka. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and Imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 65466555.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] He Xufeng, Hua Yang, Song Tao, Zhang Zongpu, Xue Zhengui, Ma Ruhui, Robertson Neil, and Guan Haibing. 2019. Unsupervised video summarization with attentive conditional generative adversarial networks. In Proceedings of the 27th ACM International Conference on Multimedia. 22962304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Huang Jia-Hong and Worring Marcel. 2020. Query-controllable video summarization. In Proceedings of the 2020 International Conference on Multimedia Retrieval. 242250. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Huang Lun, Wang Wenmin, Chen Jie, and Wei Xiao-Yong. 2019. Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 46344643.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Ji Zhong, Jiao Fang, Pang Yanwei, and Shao Ling. 2020. Deep attentive and semantic preserving video summarization. Neurocomputing 405 (2020), 200207.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Ji Zhong, Xiong Kailin, Pang Yanwei, and Li Xuelong. 2020. Video summarization with attention-based encoder-decoder networks. IEEE Transactions on Circuits and Systems for Video Technology 30, 6 (2020), 17091717.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Jiang Qing-Yuan and Li Wu-Jun. 2017. Deep cross-modal hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 32323240.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Jung Yunjae, Cho Donghyeon, Kim Dahun, Woo Sanghyun, and Kweon In So. 2019. Discriminative feature learning for unsupervised video summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 85378544. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Kendall Maurice G.. 1945. The treatment of ties in ranking problems. Biometrika 33, 3 (1945), 239251.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Kingma Diederik P. and Ba Jimmy. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.Google ScholarGoogle Scholar
  19. [19] Kiros Ryan, Zhu Yukun, Salakhutdinov Russ R., Zemel Richard, Urtasun Raquel, Torralba Antonio, and Fidler Sanja. 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems. 32943302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2012. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012), 10971105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Li Xiangpeng, Song Jingkuan, Gao Lianli, Liu Xianglong, Huang Wenbing, He Xiangnan, and Gan Chuang. 2019. Beyond RNNs: Positional self-attention with co-attention for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 86588665. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Mahasseni Behrooz, Lam Michael, and Todorovic Sinisa. 2017. Unsupervised video summarization with adversarial LSTM networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 202211.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Mei Shaohui, Guan Genliang, Wang Zhiyong, Wan Shuai, He Mingyi, and Feng David Dagan. 2015. Video summarization via minimum sparse reconstruction. Pattern Recognition 48, 2 (2015), 522533. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Mikolov Tomas, Sutskever Ilya, Chen Kai, Corrado Greg S., and Dean Jeff. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 31113119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Otani Mayu, Nakashima Yuta, Rahtu Esa, and Heikkila Janne. 2019. Rethinking the evaluation of video summaries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 75967604.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Potapov Danila, Douze Matthijs, Harchaoui Zaid, and Schmid Cordelia. 2014. Category-specific video summarization. In European Conference on Computer Vision. Springer, 540555.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Ramanishka Vasili, Das Abir, Zhang Jianming, and Saenko Kate. 2017. Top-down visual saliency guided by captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 72067215.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Rasiwasia Nikhil, Pereira Jose Costa, Coviello Emanuele, Doyle Gabriel, Lanckriet Gert R. G., Levy Roger, and Vasconcelos Nuno. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM International Conference on Multimedia. 251260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Rochan Mrigank and Wang Yang. 2019. Video summarization by learning from unpaired data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 79027911.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Rochan Mrigank, Ye Linwei, and Wang Yang. 2018. Video summarization using fully convolutional sequence networks. In Proceedings of the European Conference on Computer Vision (ECCV’18). 347363.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Sharghi Aidean, Gong Boqing, and Shah Mubarak. 2016. Query-focused extractive video summarization. In European Conference on Computer Vision. Springer, 319.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Shen Tao, Zhou Tianyi, Long Guodong, Jiang Jing, Pan Shirui, and Zhang Chengqi. 2018. Disan: Directional self-attention network for RNN/CNN-free language understanding. In 32nd AAAI Conference on Artificial Intelligence. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Song Yale, Vallmitjana Jordi, Stent Amanda, and Jaimes Alejandro. 2015. TVSum: Summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 51795187.Google ScholarGoogle Scholar
  34. [34] Szegedy Christian, Liu Wei, Jia Yangqing, Sermanet Pierre, Reed Scott, Anguelov Dragomir, Erhan Dumitru, Vanhoucke Vincent, and Rabinovich Andrew. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 19.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 59986008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Venugopalan Subhashini, Rohrbach Marcus, Donahue Jeffrey, Mooney Raymond, Darrell Trevor, and Saenko Kate. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE International Conference on Computer Vision. 45344542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Wei Huawei, Ni Bingbing, Yan Yichao, Yu Huanyu, Yang Xiaokang, and Yao Chen. 2018. Video summarization via semantic attended networks. In 32nd AAAI Conference on Artificial Intelligence. 216223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Yeung Serena, Fathi Alireza, and Li Fei-Fei. 2014. Videoset: Video summary evaluation through text. arXiv preprint arXiv:1406.5824.Google ScholarGoogle Scholar
  39. [39] Yuan Li, Tay Francis E. H., Li Ping, Zhou Li, and Feng Jiashi. 2019. Cycle-sum: Cycle-consistent adversarial LSTM networks for unsupervised video summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 91439150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Ng Joe Yue-Hei, Hausknecht Matthew, Vijayanarasimhan Sudheendra, Vinyals Oriol, Monga Rajat, and Toderici George. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 46944702.Google ScholarGoogle Scholar
  41. [41] Zhang Ke, Chao Wei-Lun, Sha Fei, and Grauman Kristen. 2016. Video summarization with long short-term memory. In European Conference on Computer Vision. Springer, 766782.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Zhang Ying and Lu Huchuan. 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV’18). 686701.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Zhao Bin, Li Xuelong, and Lu Xiaoqiang. 2017. Hierarchical recurrent neural network for video summarization. In Proceedings of the 25th ACM International Conference on Multimedia. 863871. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Zhao Bin, Li Xuelong, and Lu Xiaoqiang. 2018. HSA-RNN: Hierarchical structure-adaptive RNN for video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 74057414.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Zhong Sheng-hua, Wu Jiaxin, and Jiang Jianmin. 2019. Video summarization via spatio-temporal deep architecture. Neurocomputing 332 (2019), 224235.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Zhou Kaiyang, Qiao Yu, and Xiang Tao. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In 32nd AAAI Conference on Artificial Intelligence. 75827589. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Zhu Yukun, Kiros Ryan, Zemel Rich, Salakhutdinov Ruslan, Urtasun Raquel, Torralba Antonio, and Fidler Sanja. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision. 1927. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Zwillinger Daniel and Kokoska Stephen. 1999. CRC Standard Probability and Statistics Tables and Formulae. CRC Press.Google ScholarGoogle Scholar

Index Terms

  1. Deep Semantic and Attentive Network for Unsupervised Video Summarization

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 2
      May 2022
      494 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3505207
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 16 February 2022
      • Accepted: 1 July 2021
      • Revised: 1 May 2021
      • Received: 1 November 2020
      Published in tomm Volume 18, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format