SceneFun3D

Fine-Grained Functionality and Affordance Understanding
in 3D Scenes

CVPR 2024 Oral

1ETH Zürich 2Google 3TUM 4Microsoft


teaser-fig.

We introduce SceneFun3D, a large-scale dataset with highly accurate interaction annotations in 3D real-world indoor environments.

Abstract

Existing 3D scene understanding methods are heavily focused on 3D semantic and instance segmentation. However, identifying objects and their parts only constitutes an intermediate step towards a more fine-grained goal, which is effectively interacting with the functional interactive elements (e.g., handles, knobs, buttons) in the scene to accomplish diverse tasks. To this end, we introduce SceneFun3D, a large-scale dataset with more than 14.8k highly accurate interaction annotations for 710 high-resolution real-world 3D indoor scenes. We accompany the annotations with motion parameter information, describing how to interact with these elements, and a diverse set of natural language descriptions of tasks that involve manipulating them in the scene context. To showcase the value of our dataset, we introduce three novel tasks, namely functionality segmentation, task-driven affordance grounding and 3D motion estimation, and adapt existing state-of-the-art methods to tackle them. Our experiments show that solving these tasks in real 3D scenes remains challenging despite recent progress in closed-set and open-set 3D scene understanding.

BibTeX

@inproceedings{delitzas2024scenefun3d, 
  title = {{SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes}}, 
  author = {Delitzas, Alexandros and Takmaz, Ayca and Tombari, Federico and Sumner, Robert and Pollefeys, Marc and Engelmann, Francis}, 
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, 
  year = {2024}
}