An Embarrassingly Simple Approach for LLM with Strong ASR Capacity

Ma, Ziyang; Yang, Guanrou; Yang, Yifan; Gao, Zhifu; Wang, Jiaming; Du, Zhihao; Yu, Fan; Chen, Qian; Zheng, Siqi; Zhang, Shiliang; Chen, Xie

Computer Science > Computation and Language

arXiv:2402.08846 (cs)

[Submitted on 13 Feb 2024]

Title:An Embarrassingly Simple Approach for LLM with Strong ASR Capacity

Authors:Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen

View PDF HTML (experimental)

Abstract:In this paper, we focus on solving one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning for the LLM. We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task. To be more specific, we benchmark and explore various combinations of LLMs and speech encoders, leading to the optimal LLM-based ASR system, which we call SLAM-ASR. The proposed SLAM-ASR provides a clean setup and little task-specific design, where only the linear projector is trained. To the best of our knowledge, SLAM-ASR achieves the best performance on the Librispeech benchmark among LLM-based ASR models and even outperforms the latest LLM-based audio-universal model trained on massive pair data. Finally, we explore the capability emergence of LLM-based ASR in the process of modal alignment. We hope that our study can facilitate the research on extending LLM with cross-modality capacity and shed light on the LLM-based ASR community.

Comments:	Working in progress and will open-source soon
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2402.08846 [cs.CL]
	(or arXiv:2402.08846v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2402.08846

Submission history

From: Ziyang Ma [view email]
[v1] Tue, 13 Feb 2024 23:25:04 UTC (999 KB)

Computer Science > Computation and Language

Title:An Embarrassingly Simple Approach for LLM with Strong ASR Capacity

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:An Embarrassingly Simple Approach for LLM with Strong ASR Capacity

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators