Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

Ning, Munan; Zhu, Bin; Xie, Yujia; Lin, Bin; Cui, Jiaxi; Yuan, Lu; Chen, Dongdong; Yuan, Li

Computer Science > Computer Vision and Pattern Recognition

arXiv:2311.16103 (cs)

[Submitted on 27 Nov 2023 (v1), last revised 28 Nov 2023 (this version, v2)]

Title:Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

Authors:Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, Li Yuan

View PDF

Abstract:Video-based large language models (Video-LLMs) have been recently introduced, targeting both fundamental improvements in perception and comprehension, and a diverse range of user inquiries. In pursuit of the ultimate goal of achieving artificial general intelligence, a truly intelligent Video-LLM model should not only see and understand the surroundings, but also possess human-level commonsense, and make well-informed decisions for the users. To guide the development of such a model, the establishment of a robust and comprehensive evaluation system becomes crucial. To this end, this paper proposes \textit{Video-Bench}, a new comprehensive benchmark along with a toolkit specifically designed for evaluating Video-LLMs. The benchmark comprises 10 meticulously crafted tasks, evaluating the capabilities of Video-LLMs across three distinct levels: Video-exclusive Understanding, Prior Knowledge-based Question-Answering, and Comprehension and Decision-making. In addition, we introduce an automatic toolkit tailored to process model outputs for various tasks, facilitating the calculation of metrics and generating convenient final scores. We evaluate 8 representative Video-LLMs using \textit{Video-Bench}. The findings reveal that current Video-LLMs still fall considerably short of achieving human-like comprehension and analysis of real-world videos, offering valuable insights for future research directions. The benchmark and toolkit are available at: \url{this https URL}.

Comments:	Benchmark is available at this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2311.16103 [cs.CV]
	(or arXiv:2311.16103v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2311.16103

Submission history

From: Dongdong Chen [view email]
[v1] Mon, 27 Nov 2023 18:59:58 UTC (7,213 KB)
[v2] Tue, 28 Nov 2023 18:16:29 UTC (7,365 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators