Offline Distillation for Robot Lifelong Learning with Imbalanced Experience

ArXiv（2022）

引用 0|浏览23

摘要

Robots will experience non-stationary environment dynamics throughout their lifetime: the robot dynamics can change due to wear and tear, or its surroundings may change over time. Eventually, the robots should perform well in all of the environment variations it has encountered. At the same time, it should still be able to learn fast in a new environment. We investigate two challenges in such a lifelong learning setting: ﬁrst, existing off-policy algorithms struggle with the trade-off between being conservative to maintain good performance in the old environment and learning efﬁciently in the new environment. We propose the Ofﬂine Distillation Pipeline to break this trade-off by sep-arating the training procedure into interleaved phases of online interaction and ofﬂine distillation. Second, training with the combined datasets from multiple environments across the lifetime might create a signiﬁcant performance drop compared to training on the datasets individually. Our hypothesis is that both the imbalanced quality and size of the datasets exacerbate the extrapolation error of the Q-function during ofﬂine training over the “weaker” dataset. We propose a simple ﬁx to the issue by keeping the policy closer to the dataset during the distillation phase. In the experiments, we demonstrate these challenges and the proposed solutions with a simulated bipedal robot walking task across various environment changes. We show that the Ofﬂine Distillation Pipeline achieves better performance across all the encountered environments without affecting data collection. We also provide a comprehensive empirical study to support our hypothesis on the data imbalance issue. study the lifelong learning problem in a simulated bipedal walking task, where the goal is to maximize the forward velocity while avoiding falling. Our experiments involve a small humanoid robot, called OP3 1 , that has 20 actuated joints and has been previously used to train walking directly on hardware (Bloesch et al., 2022). All of the experiments in this work are conducted in simulation both due to limited access to hardware and for a more controlled experiment setting. However, we try our best to incorporate all the realistic considerations of the experiments, hoping that it can be deployed on real robots in the future. All of the results are averaged over 3 random seeds.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

您的评分 :

暂无评分

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn