Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

Zhang, Cheng; Cheng, Jianyi; Shumailov, Ilia; Constantinides, George A.; Zhao, Yiren

doi:10.18653/v1/2023.emnlp-main.617

Computer Science > Machine Learning

arXiv:2310.05079 (cs)

[Submitted on 8 Oct 2023 (v1), last revised 21 Oct 2023 (this version, v2)]

Title:Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

Authors:Cheng Zhang, Jianyi Cheng, Ilia Shumailov, George A. Constantinides, Yiren Zhao

View PDF

Abstract:The inference of Large language models (LLMs) requires immense computation and memory resources. To curtail these costs, quantisation has merged as a promising solution, but existing LLM quantisation mainly focuses on 8-bit. In this work, we explore the statistical and learning properties of the LLM layer and attribute the bottleneck of LLM quantisation to numerical scaling offsets. To address this, we adapt block quantisations for LLMs, a family of methods that share scaling factors across packed numbers. Block quantisations efficiently reduce the numerical scaling offsets solely from an arithmetic perspective, without additional treatments in the computational path. Our nearly-lossless quantised 6-bit LLMs achieve a $19\times$ higher arithmetic density and $5\times$ memory density than the float32 baseline, surpassing the prior art 8-bit quantisation by $2.5\times$ in arithmetic density and $1.2\times$ in memory density, without requiring any data calibration or re-training. We also share our insights into sub-8-bit LLM quantisation, including the mismatch between activation and weight distributions, optimal fine-tuning strategies, and a lower quantisation granularity inherent in the statistical properties of LLMs. The latter two tricks enable nearly-lossless 4-bit LLMs on downstream tasks. Our code is open-sourced.

Comments:	Accepted by EMNLP2023
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2310.05079 [cs.LG]
	(or arXiv:2310.05079v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2310.05079
Related DOI:	https://doi.org/10.18653/v1/2023.emnlp-main.617

Submission history

From: Cheng Zhang [view email]
[v1] Sun, 8 Oct 2023 09:05:14 UTC (717 KB)
[v2] Sat, 21 Oct 2023 12:38:52 UTC (711 KB)

Computer Science > Machine Learning

Title:Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators