Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF
CoRR(2024)
摘要
Reinforcement learning from human feedback (RLHF) has emerged as a central
tool for language model alignment. We consider online exploration in RLHF,
which exploits interactive access to human or AI feedback by deliberately
encouraging the model to produce diverse, maximally informative responses. By
allowing RLHF to confidently stray from the pre-trained model, online
exploration offers the possibility of novel, potentially super-human
capabilities, but its full potential as a paradigm for language model training
has yet to be realized, owing to computational and statistical bottlenecks in
directly adapting existing reinforcement learning techniques. We propose a new
algorithm for online exploration in RLHF, Exploratory Preference Optimization
(XPO), which is simple and practical – a one-line change to (online) Direct
Preference Optimization (DPO; Rafailov et al., 2023) – yet enjoys the
strongest known provable guarantees and promising empirical performance. XPO
augments the DPO objective with a novel and principled exploration bonus,
empowering the algorithm to explore outside the support of the initial model
and human feedback data. In theory, we show that XPO is provably
sample-efficient and converges to a near-optimal language model policy under
natural exploration conditions, irrespective of whether the initial model has
good coverage. Our analysis, which builds on the observation that DPO
implicitly performs a form of Q^⋆-approximation (or, Bellman error
minimization), combines previously disparate techniques from language modeling
and theoretical reinforcement learning in a serendipitous fashion through the
perspective of KL-regularized Markov decision processes. Empirically, we find
that XPO is more sample-efficient than non-exploratory DPO variants in a
preliminary evaluation.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn