Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
NeurIPS 2024(2024)
摘要
The quadratic complexity and weak length extrapolation of Transformers limitstheir ability to scale to long sequences, and while sub-quadratic solutionslike linear attention and state space models exist, they empiricallyunderperform Transformers in pretraining efficiency and downstream taskaccuracy. We introduce Megalodon, a neural architecture for efficient sequencemodeling with unlimited context length. Megalodon inherits the architecture ofMega (exponential moving average with gated attention), and further introducesmultiple technical components to improve its capability and stability,including complex exponential moving average (CEMA), timestep normalizationlayer, normalized attention mechanism and pre-norm with two-hop residualconfiguration. In a controlled head-to-head comparison with Llama2, Megalodonachieves better efficiency than Transformer in the scale of 7 billionparameters and 2 trillion training tokens. Megalodon reaches a training loss of1.70, landing mid-way between Llama2-7B (1.75) and 13B (1.67). Code:https://github.com/XuezheMax/megalodon
更多查看译文
关键词
Mega,Efficient Architecture,Long Sequence Modeling,Unlimited Context Length
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn