Abstract
With the rapid growth of video data, video summarization is a promising approach to shorten a lengthy video into a compact version. Although supervised summarization approaches have achieved state-of-the-art performance, they require frame-level annotated labels. Such an annotation process is time-consuming and tedious. In this article, we propose a novel deep summarization framework named Deep Semantic and Attentive Network for Video Summarization (DSAVS) that can select the most semantically representative summary by minimizing the distance between video representation and text representation without any frame-level labels. Another challenge associated with video summarization tasks mainly originates from the difficulty of considering temporal information over a long time. Long Short-Term Memory (LSTM) performs well for temporal dependencies modeling but does not work well with long video clips. Therefore, we introduce a self-attention mechanism into our summarization framework to capture the long-range temporal dependencies among the frames. Extensive experiments on two popular benchmark datasets, i.e., SumMe and TVSum, show that our proposed framework outperforms other state-of-the-art unsupervised approaches and even most supervised methods.
- [1] . 2020. Unsupervised video summarization via attention-driven adversarial learning. In International Conference on Multimedia Modeling. Springer, 492–504.Google ScholarDigital Library
- [2] . 2011. Towards scalable summarization of consumer videos via sparse dictionary selection. IEEE Transactions on Multimedia 14, 1 (2011), 66–75. Google ScholarDigital Library
- [3] . 2011. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters 32, 1 (2011), 56–68. Google ScholarDigital Library
- [4] . 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248–255.Google ScholarCross Ref
- [5] . 2013. Efficient visual attention based framework for extracting key frames from videos. Signal Processing: Image Communication 28, 1 (2013), 34–44. Google ScholarDigital Library
- [6] . 2013. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems. 2121–2129. Google ScholarDigital Library
- [7] . 2012. Visual-textual joint relevance learning for tag-based social image search. IEEE Transactions on Image Processing 22, 1 (2012), 363–376. Google ScholarDigital Library
- [8] . 2014. Creating summaries from user videos. In European Conference on Computer Vision. Springer, 505–520.Google ScholarCross Ref
- [9] . 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and Imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546–6555.Google ScholarCross Ref
- [10] . 2019. Unsupervised video summarization with attentive conditional generative adversarial networks. In Proceedings of the 27th ACM International Conference on Multimedia. 2296–2304. Google ScholarDigital Library
- [11] . 2020. Query-controllable video summarization. In Proceedings of the 2020 International Conference on Multimedia Retrieval. 242–250. Google ScholarDigital Library
- [12] . 2019. Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 4634–4643.Google ScholarCross Ref
- [13] . 2020. Deep attentive and semantic preserving video summarization. Neurocomputing 405 (2020), 200–207.Google ScholarCross Ref
- [14] . 2020. Video summarization with attention-based encoder-decoder networks. IEEE Transactions on Circuits and Systems for Video Technology 30, 6 (2020), 1709–1717.Google ScholarCross Ref
- [15] . 2017. Deep cross-modal hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3232–3240.Google ScholarCross Ref
- [16] . 2019. Discriminative feature learning for unsupervised video summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8537–8544. Google ScholarDigital Library
- [17] . 1945. The treatment of ties in ranking problems. Biometrika 33, 3 (1945), 239–251.Google ScholarCross Ref
- [18] . 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.Google Scholar
- [19] . 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems. 3294–3302. Google ScholarDigital Library
- [20] . 2012. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012), 1097–1105. Google ScholarDigital Library
- [21] . 2019. Beyond RNNs: Positional self-attention with co-attention for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8658–8665. Google ScholarDigital Library
- [22] . 2017. Unsupervised video summarization with adversarial LSTM networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 202–211.Google ScholarCross Ref
- [23] . 2015. Video summarization via minimum sparse reconstruction. Pattern Recognition 48, 2 (2015), 522–533. Google ScholarDigital Library
- [24] . 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111–3119. Google ScholarDigital Library
- [25] . 2019. Rethinking the evaluation of video summaries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7596–7604.Google ScholarCross Ref
- [26] . 2014. Category-specific video summarization. In European Conference on Computer Vision. Springer, 540–555.Google ScholarCross Ref
- [27] . 2017. Top-down visual saliency guided by captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7206–7215.Google ScholarCross Ref
- [28] . 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM International Conference on Multimedia. 251–260. Google ScholarDigital Library
- [29] . 2019. Video summarization by learning from unpaired data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7902–7911.Google ScholarCross Ref
- [30] . 2018. Video summarization using fully convolutional sequence networks. In Proceedings of the European Conference on Computer Vision (ECCV’18). 347–363.Google ScholarCross Ref
- [31] . 2016. Query-focused extractive video summarization. In European Conference on Computer Vision. Springer, 3–19.Google ScholarCross Ref
- [32] . 2018. Disan: Directional self-attention network for RNN/CNN-free language understanding. In 32nd AAAI Conference on Artificial Intelligence. Google ScholarDigital Library
- [33] . 2015. TVSum: Summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5179–5187.Google Scholar
- [34] . 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.Google ScholarCross Ref
- [35] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008. Google ScholarDigital Library
- [36] . 2015. Sequence to sequence-video to text. In Proceedings of the IEEE International Conference on Computer Vision. 4534–4542. Google ScholarDigital Library
- [37] . 2018. Video summarization via semantic attended networks. In 32nd AAAI Conference on Artificial Intelligence. 216–223. Google ScholarDigital Library
- [38] . 2014. Videoset: Video summary evaluation through text. arXiv preprint arXiv:1406.5824.Google Scholar
- [39] . 2019. Cycle-sum: Cycle-consistent adversarial LSTM networks for unsupervised video summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9143–9150. Google ScholarDigital Library
- [40] . 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4694–4702.Google Scholar
- [41] . 2016. Video summarization with long short-term memory. In European Conference on Computer Vision. Springer, 766–782.Google ScholarCross Ref
- [42] . 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV’18). 686–701.Google ScholarDigital Library
- [43] . 2017. Hierarchical recurrent neural network for video summarization. In Proceedings of the 25th ACM International Conference on Multimedia. 863–871. Google ScholarDigital Library
- [44] . 2018. HSA-RNN: Hierarchical structure-adaptive RNN for video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7405–7414.Google ScholarCross Ref
- [45] . 2019. Video summarization via spatio-temporal deep architecture. Neurocomputing 332 (2019), 224–235.Google ScholarDigital Library
- [46] . Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In 32nd AAAI Conference on Artificial Intelligence. 7582–7589. Google ScholarDigital Library
- [47] . 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision. 19–27. Google ScholarDigital Library
- [48] . 1999. CRC Standard Probability and Statistics Tables and Formulae. CRC Press.Google Scholar
Index Terms
- Deep Semantic and Attentive Network for Unsupervised Video Summarization
Recommendations
Unsupervised Video Summarization with Attentive Conditional Generative Adversarial Networks
MM '19: Proceedings of the 27th ACM International Conference on MultimediaWith the rapid growth of video data, video summarization technique plays a key role in reducing people's efforts to explore the content of videos by generating concise but informative summaries. Though supervised video summarization approaches have been ...
Hierarchical Recurrent Neural Network for Video Summarization
MM '17: Proceedings of the 25th ACM international conference on MultimediaExploiting the temporal dependency among video frames or subshots is very important for the task of video summarization. Practically, RNN is good at temporal dependency modeling, and has achieved overwhelming performance in many video-based tasks, such ...
Self-attention binary neural tree for video summarization
Highlights- A self-attention binary neural tree (SABTNet) is proposed for video summarization.
AbstractIn this paper, we address the problem of shot-level video summarization, which aims at selecting a subset of video shots as a summary to represent the original video contents compactly and completely. Most existing methods rely on ...
Comments