3DQ-Nets: Visual Concepts Emerge in Pose Equivariant 3D Quantized Neural Scene Representations

Mihir Prabhudesai, Shamit Lal, Hsiao-Yu Fish Tung, Adam W. Harley, Shubhankar Potdar, Katerina Fragkiadaki; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2020, pp. 388-389

Abstract


We present a framework that learns 3D object concepts without supervision from 3D annotations. Our model detects objects, quantizes their features into prototypes, infers associations across detected objects in different scenes, and uses those to (self) supervise its visual feature representations. Object detection, correspondence inference, representation learning, and object to prototype compression takes place in a 3-dimensional visual feature space, inferred from the input RGB-D images using differentiable inverse graphics architectures, optimized end-to-end for predicting views of scenes. Our 3D feature space learns to be invariant to the camera viewpoint and disentangled from projection artifacts, foreshortenings or cross-object occlusions. As a result, 3D features learn to establish accurate correspondences across objects found under varying camera viewpoints, size and pose, and compressing them into prototypes. Our prototypes are represented similarly by 3-dimensional feature maps. They are rotated and scaled appropriately during matching to explain object instances in a variety of 3D poses and scales. We show this pose and scale equivariance permits much better compressibility of objects into their prototypical representations. Our model is optimized with a mix of end-to-end gradient descent and expectation-maximization iterations. We show 3D object detection, correspondence inference and object-to-prototype clustering improve over time and help one another. We demonstrate the usefulness of our model in few-shot learning: one or few object labels suffice to learn a pose-aware 3D object detector for the object category. To the best of our knowledge, this is the first system that demonstrates that 3D visual concepts emerge, without language annotating, rather, by moving around and relating episodic visual experiences, in a self-paced automated learning process.

Related Material


[pdf]
[bibtex]
@InProceedings{Prabhudesai_2020_CVPR_Workshops,
author = {Prabhudesai, Mihir and Lal, Shamit and Tung, Hsiao-Yu Fish and Harley, Adam W. and Potdar, Shubhankar and Fragkiadaki, Katerina},
title = {3DQ-Nets: Visual Concepts Emerge in Pose Equivariant 3D Quantized Neural Scene Representations},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2020}
}