FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
CoRR(2024)
摘要
Six-bit quantization (FP6) can effectively reduce the size of large language
models (LLMs) and preserve the model quality consistently across varied
applications. However, existing systems do not provide Tensor Core support for
FP6 quantization and struggle to achieve practical performance improvements
during LLM inference. It is challenging to support FP6 quantization on GPUs due
to (1) unfriendly memory access of model weights with irregular bit-width and
(2) high runtime overhead of weight de-quantization. To address these problems,
we propose TC-FPx, the first full-stack GPU kernel design scheme with unified
Tensor Core support of float-point weights for various quantization bit-width.
We integrate TC-FPx kernel into an existing inference system, providing new
end-to-end support (called FP6-LLM) for quantized LLM inference, where better
trade-offs between inference cost and model quality are achieved. Experiments
show that FP6-LLM enables the inference of LLaMA-70b using only a single GPU,
achieving 1.69x-2.65x higher normalized inference throughput than the FP16
baseline. The source code will be publicly available soon.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn