3D Feature Distillation with
Object-Centric Priors

1Interactive Robot Learning Lab, University of Groningen
2The University of Edinburgh
3University College London

Abstract

Grounding natural language to the physical world is a ubiquitous topic with a wide range of applications in computer vision and robotics. Recently, 2D vision-language models such as CLIP have been widely popularized, due to their impressive capabilities for open-vocabulary grounding in 2D images. Subsequent works aim to elevate 2D CLIP features to 3D via feature distillation, but either learn neural fields that are scene-specific and hence lack generalization, or focus on indoor room scan data that require access to multiple camera views, which is not practical in robot manipulation scenarios. % assume access to multiple camera images to construct dense point-clouds, which is non practical in robotic applications. Additionally, related methods typically fuse features at pixel-level and assume that all camera views are equally informative. In this work, we show that this approach leads to sub-optimal 3D features, both in terms of grounding accuracy, as well as segmentation crispness. To alleviate this, we propose a multi-view feature fusion strategy that employs object-centric priors to eliminate uninformative views based on semantic information, and fuse features at object-level via instance segmentation masks. % Further, we depart from the assumption of access to multiple camera views at test time, instead learning view-independet features from single-view RGB-D. % We distill the object-centric 3D features with a sparse 3D encoder To distill our object-centric 3D features, we generate a large-scale synthetic multi-view dataset of cluttered tabletop scenes, spawning 15k scenes from over 3300 unique object instances, which we make publicly available. % We apply our object-centric fusion strategy and distill the resulting dense 3D CLIP features with a sparse point encoder. We show that our method reconstructs 3D CLIP features with improved grounding capacity and spatial consistency, while doing so from single-view RGB-D, thus departing from the assumption of multiple camera views at test time. % Our method learns view-independent 3D CLIP features from single-view RGB-D, thus departing from the assumption of access to multiple camera views at inference. Finally, we show that our approach can generalize to novel tabletop domains and be re-purposed for 3D instance segmentation without fine-tuning, and demonstrate its utility for language-guided robotic grasping in clutter.

MV-TOD: Multi-view Tabletop Object Dataset

MV-TOD dataset image
  • 15k scenes, >3300 object CAD models
  • 73 views per scene
  • 2D/3D segmentation masks
  • 6D object poses
  • 6-DoF grasp poses
  • Object-level text annotations

DROP-CLIP: Distilled Representations with Object-Centric Priors from CLIP

alt text

Object-centric priors:

  • 2D object-level CLIP features with masked crops
  • Semantic Informativeness Metric to weight multi-view fusion
  • Object boundaries with 3D segmentation masks

Semantic Informativeness Metric

Weight the contribution of each view according to CLIP's capability to ground a text annotation.

Qualitative Results

Real-Time 3D Visual Grounding

Interactive Grounding (Simulation)

Robustness to Input View (Simulation)

Robustness to Input View (YCB-M)

Robustness to Input View (Real Robot)

Robot Experiments

Simulation

Real Robot

Comparisons with SfM Methods

Comparison image 1
Comparison image 2

Compared to online feature distillation methods (such as NeRFs and Gaussian-Splatting), our approach:

  • Is applicable zero-shot without online training in specific scenes.
  • Supports real-time inference.
  • Operates from single-view, does not require capturing multiple scene images.
  • Does not require external segmentation models.

BibTeX

@misc{tziafas20243dfeaturedistillationobjectcentric,
        title={3D Feature Distillation with Object-Centric Priors}, 
        author={Georgios Tziafas and Yucheng Xu and Zhibin Li and Hamidreza Kasaei},
        year={2024},
        eprint={2406.18742},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2406.18742}, 
  }