Grounding natural language to the physical world is a ubiquitous topic with a wide range of applications in computer vision and robotics. Recently, 2D vision-language models such as CLIP have been widely popularized, due to their impressive capabilities for open-vocabulary grounding in 2D images. Subsequent works aim to elevate 2D CLIP features to 3D via feature distillation, but either learn neural fields that are scene-specific and hence lack generalization, or focus on indoor room scan data that require access to multiple camera views, which is not practical in robot manipulation scenarios. % assume access to multiple camera images to construct dense point-clouds, which is non practical in robotic applications. Additionally, related methods typically fuse features at pixel-level and assume that all camera views are equally informative. In this work, we show that this approach leads to sub-optimal 3D features, both in terms of grounding accuracy, as well as segmentation crispness. To alleviate this, we propose a multi-view feature fusion strategy that employs object-centric priors to eliminate uninformative views based on semantic information, and fuse features at object-level via instance segmentation masks. % Further, we depart from the assumption of access to multiple camera views at test time, instead learning view-independet features from single-view RGB-D. % We distill the object-centric 3D features with a sparse 3D encoder To distill our object-centric 3D features, we generate a large-scale synthetic multi-view dataset of cluttered tabletop scenes, spawning 15k scenes from over 3300 unique object instances, which we make publicly available. % We apply our object-centric fusion strategy and distill the resulting dense 3D CLIP features with a sparse point encoder. We show that our method reconstructs 3D CLIP features with improved grounding capacity and spatial consistency, while doing so from single-view RGB-D, thus departing from the assumption of multiple camera views at test time. % Our method learns view-independent 3D CLIP features from single-view RGB-D, thus departing from the assumption of access to multiple camera views at inference. Finally, we show that our approach can generalize to novel tabletop domains and be re-purposed for 3D instance segmentation without fine-tuning, and demonstrate its utility for language-guided robotic grasping in clutter.
Object-centric priors:
Weight the contribution of each view according to CLIP's capability to ground a text annotation.
Interactive Grounding (Simulation)
Robustness to Input View (Simulation)
Robustness to Input View (YCB-M)
Robustness to Input View (Real Robot)
Simulation
Real Robot
Compared to online feature distillation methods (such as NeRFs and Gaussian-Splatting), our approach:
@misc{tziafas20243dfeaturedistillationobjectcentric,
title={3D Feature Distillation with Object-Centric Priors},
author={Georgios Tziafas and Yucheng Xu and Zhibin Li and Hamidreza Kasaei},
year={2024},
eprint={2406.18742},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2406.18742},
}