PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs

ICML(2024)

引用 0|浏览96
摘要
Vision language models (VLMs) have shown impressive capabilities across avariety of tasks, from logical reasoning to visual understanding. This opensthe door to richer interaction with the world, for example robotic control.However, VLMs produce only textual outputs, while robotic control and otherspatial tasks require outputting continuous coordinates, actions, ortrajectories. How can we enable VLMs to handle such settings withoutfine-tuning on task-specific data? In this paper, we propose a novel visual prompting approach for VLMs that wecall Prompting with Iterative Visual Optimization (PIVOT), which casts tasks asiterative visual question answering. In each iteration, the image is annotatedwith a visual representation of proposals that the VLM can refer to (e.g.,candidate robot actions, localizations, or trajectories). The VLM then selectsthe best ones for the task. These proposals are iteratively refined, allowingthe VLM to eventually zero in on the best available answer. We investigatePIVOT on real-world robotic navigation, real-world manipulation from images,instruction following in simulation, and additional spatial inference taskssuch as localization. We find, perhaps surprisingly, that our approach enableszero-shot control of robotic systems without any robot training data,navigation in a variety of environments, and other capabilities. Althoughcurrent performance is far from perfect, our work highlights potentials andlimitations of this new regime and shows a promising approach forInternet-Scale VLMs in robotic and spatial reasoning domains. Website:pivot-prompt.github.io and HuggingFace:https://huggingface.co/spaces/pivot-prompt/pivot-prompt-demo.
更多
查看译文
关键词
Simulations
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
0
您的评分 :

暂无评分

数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn