- [pdf] [supp]
Scan2Cap: Context-Aware Dense Captioning in RGB-D Scans
We introduce the new task of dense captioning in RGB-D scans. As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. To address the 3D object detecting and describing problem at the same time, we propose Scan2Cap, an end-to-end trained architecture, to detect objects in the input scene and generate the descriptions for all of them in natural language. We apply an attention-based captioning method to generate descriptive tokens while referring to the related components in the local context. To better handle the relative spatial relations between objects, a message passing graph module is applied to learn the relation features, which are later used in the captioning phase. On the recently proposed ScanRefer dataset, we show that our architecture can effectively localize and describe the 3D objects in the scene. It also outperforms the 2D-based methods on the 3D dense captioning task by a big margin.