Training Large Language Models for Reasoning Through Reverse Curriculum Reinforcement Learning

ICML（2024）

引用 0|浏览67

摘要

In this paper, we propose R^3: Learning Reasoning through ReverseCurriculum Reinforcement Learning (RL), a novel method that employs onlyoutcome supervision to achieve the benefits of process supervision for largelanguage models. The core challenge in applying RL to complex reasoning is toidentify a sequence of actions that result in positive rewards and provideappropriate supervision for optimization. Outcome supervision provides sparserewards for final results without identifying error locations, whereas processsupervision offers step-wise rewards but requires extensive manual annotation.R^3 overcomes these limitations by learning from correct demonstrations.Specifically, R^3 progressively slides the start state of reasoning from ademonstration's end to its beginning, facilitating easier model exploration atall stages. Thus, R^3 establishes a step-wise curriculum, allowing outcomesupervision to offer step-level signals and precisely pinpoint errors. UsingLlama2-7B, our method surpasses RL baseline on eight reasoning tasks by 4.1points on average. Notebaly, in program-based reasoning on GSM8K, it exceedsthe baseline by 4.2 points across three backbone models, and without anyextra data, Codellama-7B + R^3 performs comparable to larger models orclosed-source models.

查看译文

关键词

Online Learning,Adaptive Learning Environments,Learning Analytics,Student Performance Prediction,Intelligent Tutoring Systems

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

您的评分 :

暂无评分

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn