Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference
ICML(2024)
摘要
Many computational factors limit broader deployment of large language models.In this paper, we focus on a memory bottleneck imposed by the key-value (KV)cache, a computational shortcut that requires storing previous KV pairs duringdecoding. While existing KV cache methods approach this problem by pruning orevicting large swaths of relatively less important KV pairs to dramaticallyreduce the memory footprint of the cache, they can have limited success intasks that require recollecting a majority of previous tokens. To alleviatethis issue, we propose LESS, a simple integration of a (nearly free) constantsized cache with eviction-based cache methods, such that all tokens can bequeried at later decoding steps. Its ability to retain information throughouttime shows merit on a variety of tasks where we demonstrate LESS can helpreduce the performance gap from caching everything, sometimes even matching it,all while being efficient.
更多查看译文
关键词
Regenerating Codes,Hashing,Erasure Coding
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn