Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences

arXiv (Cornell University)（2024）

引用 0|浏览126

摘要

This paper studies post-training large language models (LLMs) usingpreference feedback from a powerful oracle to help a model iteratively improveover itself. The typical approach for post-training LLMs involves ReinforcementLearning from Human Feedback (RLHF), which traditionally separates rewardlearning and subsequent policy optimization. However, such a rewardmaximization approach is limited by the nature of "point-wise" rewards (such asBradley-Terry model), which fails to express complex intransitive or cyclicpreference relations. While advances on RLHF show reward learning and policyoptimization can be merged into a single contrastive objective for stability,they yet still remain tethered to the reward maximization framework. Recently,a new wave of research sidesteps the reward maximization presumptions in favorof directly optimizing over "pair-wise" or general preferences. In this paper,we introduce Direct Nash Optimization (DNO), a provable and scalable algorithmthat marries the simplicity and stability of contrastive learning withtheoretical generality from optimizing general preferences. Because DNO is abatched on-policy algorithm using a regression-based objective, itsimplementation is straightforward and efficient. Moreover, DNO enjoys monotonicimprovement across iterations that help it improve even over a strong teacher(such as GPT-4). In our experiments, a resulting 7B parameter Orca-2.5 modelaligned by DNO achieves the state-of-the-art win-rate against GPT-4-Turbo of33gain of 26far more parameters, including Mistral Large, Self-Rewarding LM (70Bparameters), and older versions of GPT-4.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

您的评分 :

暂无评分

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn