TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
arXiv (Cornell University)(2024)
摘要
With large language models (LLMs) widely deployed in long content generationrecently, there has emerged an increasing demand for efficient long-sequenceinference support. However, key-value (KV) cache, which is stored to avoidre-computation, has emerged as a critical bottleneck by growing linearly insize with the sequence length. Due to the auto-regressive nature of LLMs, theentire KV cache will be loaded for every generated token, resulting in lowutilization of computational cores and high latency. While various compressionmethods for KV cache have been proposed to alleviate this issue, they sufferfrom degradation in generation quality. We introduce TriForce, a hierarchicalspeculative decoding system that is scalable to long sequence generation. Thisapproach leverages the original model weights and dynamic sparse KV cache viaretrieval as a draft model, which serves as an intermediate layer in thehierarchy and is further speculated by a smaller model to reduce its draftinglatency. TriForce not only facilitates impressive speedups for Llama2-7B-128K,achieving up to 2.31× on an A100 GPU but also showcases scalability inhandling even longer contexts. For the offloading setting on two RTX 4090 GPUs,TriForce achieves 0.108s/tokenx2014only half as slow as theauto-regressive baseline on an A100, which attains 7.78× on ouroptimized offloading system. Additionally, TriForce performs 4.86× thanDeepSpeed-Zero-Inference on a single RTX 4090 GPU. TriForce's robustness ishighlighted by its consistently outstanding performance across varioustemperatures. The code is available athttps://github.com/Infini-AI-Lab/TriForce.
更多查看译文
关键词
High-Performance Computing,Performance Optimization,Heterogeneous Computing,Hashing,Multicore Architectures
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn