Towards Open-World Grasping with Large Vision-Language Models

Conference on Robot Learning 2024
Interactive Robot Learning Lab
Department of Artificial Intelligence, University of Groningen

Abstract

The ability to grasp objects in-the-wild from open-ended language instructions constitutes a fundamental challenge in robotics. An open-world grasping system should be able to combine high-level contextual with low-level physical-geometric reasoning in order to be applicable in arbitrary scenarios. Recent works exploit the web-scale knowledge inherent in large language models (LLMs) to plan and reason in robotic context, but rely on external vision and action models to ground such knowledge into the environment and parameterize actuation. This setup suffers from two major bottlenecks: a) the LLM’s reasoning capacity is constrained by the quality of visual grounding, and b) LLMs do not contain low-level spatial understanding of the world, which is essential for grasping in contact-rich scenarios. In this work we demonstrate that modern vision-language models (VLMs) are capable of tackling such limitations, as they are implicitly grounded and can jointly reason about semantics and geometry. We propose OWG, an open-world grasping pipeline that combines VLMs with segmentation and grasp synthesis models to unlock grounded world understanding in three stages: open-ended referring segmentation, grounded grasp planning and grasp ranking via contact reasoning, all of which can be applied zero-shot via suitable visual prompting mechanisms. We conduct extensive evaluation in cluttered indoor scene datasets to showcase OWG’s robustness in grounding from open-ended language, as well as open-world robotic grasping experiments in both simulation and hardware that demonstrate superior performance compared to previous supervised and zero-shot LLM-based methods.

Open-World Grasper (OWG)

alt text

Given a user instruction and an observation, OWG first invokes a segmentation model to recover pixel-level masks, and overlays them with numeric IDs as visual markers in a new image. Then the VLM subsequently activates three stages: (i) grounding the target object from the language expression in the marked image, (ii) planning on whether it should grasp the target or remove a surrounding object, and (iii) invoking a grasp synthesis model to generate grasps and ranking them according to the object's shape and neighbouring information. The best grasp pose (highlighted here in pink - not part of the prompt) is executed and the observation is updated for a new run, until the target object is grasped.human user provides an instruction and a language parser synthesizes an executable program (bottom left), built out of a primitives library (bottom middle). A program executor utilises a set of concept grounding modules to ground words to different objects (center) and executes the predicted program step-by-step (top right), in order to identify the queried object and instruct the robot to grasp it (bottom right).

Zero-Shot Mark-based Visual Prompting

alt text

Example GPT-4v responses (from left to right): a) Open-ended referring segmentation, i.e., grounding, b) Grounded grasp planning, and c) Grasp ranking via contact reasoning. We omit parts of the prompt and response for brievity. We draw visual markers such as instance segmentations and grasp proposals, along with unique label IDs, which enables GPT-4v to reason about the objects in natural language by referring to their IDs.

Robot Experiments

Robustness to Open-Ended Language

alt text

In order to evaluate the open-ended potential of OWG for grounding, we curate the OCID dataset for referring expressions of different types: a) Name, which contains an open-vocabulary description of an object (e.g. Choco Krispies), b) Attribute, which described a specific object attribute such as color, texture, shape, material or state (e.g. opened cereal box), c) spatial / visual / semantic relations, which contain multi-object references that describe relations between objects (e.g. cereal behind bowl), d) affordances, which contain a verb indicating the user's intent, requiring contextual understanding to resolve (i.e. I want to wipe my hands), and e) multi-hop, which contain multiple rounds of reasoning step to reach a final answer (e.g. second largest cereal box from the left of the blue bowl). We find that GPT-4v, when prompted with explicit marker-based visual prompts can outperform previous CLIP-based methods, as well as other visually-grounded open-source VLMs.

BibTeX

@article{tziafas2024openworldgraspinglargevisionlanguage,
      title={Towards Open-World Grasping with Large Vision-Language Models}, 
      author={Georgios Tziafas and Hamidreza Kasaei},
      year={2024},
      journal={8th Conference on Robot Learning (CoRL 2024)},
}