CoNVOI: Context-aware Navigation Using Vision Language Models in Outdoor and Indoor Environments

2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)(2024)

引用 0|浏览98
We present ConVOI, a novel method for autonomous robot navigation inreal-world indoor and outdoor environments using Vision Language Models (VLMs).We employ VLMs in two ways: first, we leverage their zero-shot imageclassification capability to identify the context or scenario (e.g., indoorcorridor, outdoor terrain, crosswalk, etc) of the robot's surroundings, andformulate context-based navigation behaviors as simple text prompts (e.g.“stay on the pavement"). Second, we utilize their state-of-the-art semanticunderstanding and logical reasoning capabilities to compute a suitabletrajectory given the identified context. To this end, we propose a novelmulti-modal visual marking approach to annotate the obstacle-free regions inthe RGB image used as input to the VLM with numbers, by correlating it with alocal occupancy map of the environment. The marked numbers ground imagelocations in the real-world, direct the VLM's attention solely to navigablelocations, and elucidate the spatial relationships between them and terrainsdepicted in the image to the VLM. Next, we query the VLM to select numbers onthe marked image that satisfy the context-based behavior text prompt, andconstruct a reference path using the selected numbers. Finally, we propose amethod to extrapolate the reference trajectory when the robot's environmentalcontext has not changed to prevent unnecessary VLM queries. We use thereference trajectory to guide a motion planner, and demonstrate that it leadsto human-like behaviors (e.g. not cutting through a group of people, usingcrosswalks, etc.) in various real-world indoor and outdoor scenarios.
Visual Question Answering,Object Recognition,Geovisualization,Image Captioning,Language Understanding
AI 理解论文
您的评分 :

