Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach

University of Groningen
arXiv preprint

Abstract

In this paper we present a neurosymbolic architecture for coupling language-guided visual reasoning with robot manipulation. A non-expert human user can prompt the robot agent using unconstrained natural language, providing a referring expression (REF), a question (VQA) or a grasp action instruction. The system tackles all cases in a task-agnostic fashion through the utilization of a shared library of primitive skills. Each primitive handles an independent sub-task, such as reasoning about visual attributes, spatial relation comprehension, logic and enumeration, as well as arm control. A language parser maps the input query to an executable program composed of such primitives, depending on the context.While some primitives are purely symbolic operations (e.g. counting), others are trainable neural functions (e.g. image-text matching), therefore marrying the interpretability and systematic generalization benefits of discrete symbolic approaches with the scalability and representational power of deep networks.We generate a 3D vision-and-language synthetic dataset of tabletop scenes in a simulation environment to train our approach and perform extensive evaluations in both synthetic and real-world scenes. Results showcase the benefits of our approach in terms of accuracy, sample-efficiency and robustness to the users vocabulary, while being transferable to real-world scenes with few-shot visual fine-tuning. Finally, we integrate our method with a robot framework and demonstrate how it can serve as an interpretable solution for an interactive object picking task, both in simulation and with a real robot. We make our code and datasets available in https://gtziafas.github.io/neurosymbolic-manipulation.

Method

alt text

A schematic of the proposed framework. First, objects are segmented and localized in 3D space (top left) and the scene is represented as a graph of extracted object-based features (visual, grasp pose) as nodes and spatial relation features as edges (top middle). A human user provides an instruction and a language parser synthesizes an executable program (bottom left), built out of a primitives library (bottom middle). A program executor utilises a set of concept grounding modules to ground words to different objects (center) and executes the predicted program step-by-step (top right), in order to identify the queried object and instruct the robot to grasp it (bottom right).

Datasets

Robot Demonstrations

Handling Failures via Interactivity

Video Presentation

BibTeX

@misc{tziafas2023hybrid,
      title={A Hybrid Compositional Reasoning Approach for Interactive Robot Manipulation}, 
      author={Georgios Tziafas and Hamidreza Kasaei},
      year={2023},
      eprint={2210.00858},
      archivePrefix={arXiv},
      primaryClass={cs.RO}
}