Scaling Up Video Summarization Pretraining with Large Language Models

Argaw, Dawit Mureja; Yoon, Seunghyun; Heilbron, Fabian Caba; Deilamsalehy, Hanieh; Bui, Trung; Wang, Zhaowen; Dernoncourt, Franck; Chung, Joon Son

Computer Science > Computer Vision and Pattern Recognition

arXiv:2404.03398 (cs)

[Submitted on 4 Apr 2024]

Title:Scaling Up Video Summarization Pretraining with Large Language Models

Authors:Dawit Mureja Argaw, Seunghyun Yoon, Fabian Caba Heilbron, Hanieh Deilamsalehy, Trung Bui, Zhaowen Wang, Franck Dernoncourt, Joon Son Chung

View PDF HTML (experimental)

Abstract:Long-form video content constitutes a significant portion of internet traffic, making automated video summarization an essential research problem. However, existing video summarization datasets are notably limited in their size, constraining the effectiveness of state-of-the-art methods for generalization. Our work aims to overcome this limitation by capitalizing on the abundance of long-form videos with dense speech-to-video alignment and the remarkable capabilities of recent large language models (LLMs) in summarizing long text. We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset using LLMs as Oracle summarizers. By leveraging the generated dataset, we analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them. To facilitate further research in the field, our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals. Extensive experiments clearly indicate that our proposed approach sets a new state-of-the-art in video summarization across several benchmarks.

Comments:	Accepted to CVPR 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2404.03398 [cs.CV]
	(or arXiv:2404.03398v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2404.03398

Submission history

From: Dawit Mureja Argaw [view email]
[v1] Thu, 4 Apr 2024 11:59:06 UTC (636 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Scaling Up Video Summarization Pretraining with Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Scaling Up Video Summarization Pretraining with Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators