Training Large Language Models for Reasoning Through Reverse Curriculum Reinforcement Learning
ICML(2024)
摘要
In this paper, we propose R^3: Learning Reasoning through ReverseCurriculum Reinforcement Learning (RL), a novel method that employs onlyoutcome supervision to achieve the benefits of process supervision for largelanguage models. The core challenge in applying RL to complex reasoning is toidentify a sequence of actions that result in positive rewards and provideappropriate supervision for optimization. Outcome supervision provides sparserewards for final results without identifying error locations, whereas processsupervision offers step-wise rewards but requires extensive manual annotation.R^3 overcomes these limitations by learning from correct demonstrations.Specifically, R^3 progressively slides the start state of reasoning from ademonstration's end to its beginning, facilitating easier model exploration atall stages. Thus, R^3 establishes a step-wise curriculum, allowing outcomesupervision to offer step-level signals and precisely pinpoint errors. UsingLlama2-7B, our method surpasses RL baseline on eight reasoning tasks by 4.1points on average. Notebaly, in program-based reasoning on GSM8K, it exceedsthe baseline by 4.2 points across three backbone models, and without anyextra data, Codellama-7B + R^3 performs comparable to larger models orclosed-source models.
更多查看译文
关键词
Online Learning,Adaptive Learning Environments,Learning Analytics,Student Performance Prediction,Intelligent Tutoring Systems
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn