A Simple Vision Transformer for Weakly Semi-supervised 3D Object Detection

Zhang, Dingyuan; Liang, Dingkang; Zou, Zhikang; Li, Jingyu; Ye, Xiaoqing; Liu, Zhe; Tan, Xiao; Bai, Xiang

Dingyuan Zhang, Dingkang Liang, Zhikang Zou, Jingyu Li, Xiaoqing Ye, Zhe Liu, Xiao Tan, Xiang Bai; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 8373-8383

Abstract

Advanced 3D object detection methods usually rely on large-scale, elaborately labeled datasets to achieve good performance. However, labeling the bounding boxes for the 3D objects is difficult and expensive. Although semi-supervised (SS3D) and weakly-supervised 3D object detection (WS3D) methods can effectively reduce the annotation cost, they suffer from two limitations: 1) their performance is far inferior to the fully-supervised counterparts; 2) they are difficult to adapt to different detectors or scenes (e.g, indoor or outdoor). In this paper, we study weakly semi-supervised 3D object detection (WSS3D) with point annotations, where the dataset comprises a small number of fully labeled and massive weakly labeled data with a single point annotated for each 3D object. To fully exploit the point annotations, we employ the plain and non-hierarchical vision transformer to form a point-to-box converter, termed ViT-WSS3D. By modeling global interactions between LiDAR points and corresponding weak labels, our ViT-WSS3D can generate high-quality pseudo-bounding boxes, which are then used to train any 3D detectors without exhaustive tuning. Extensive experiments on indoor and outdoor datasets (SUN RGBD and KITTI) show the effectiveness of our method. In particular, when only using 10% fully labeled and the rest as point labeled data, our ViT-WSS3D can enable most detectors to achieve similar performance with the oracle model using 100% fully labeled data.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Zhang_2023_ICCV, author = {Zhang, Dingyuan and Liang, Dingkang and Zou, Zhikang and Li, Jingyu and Ye, Xiaoqing and Liu, Zhe and Tan, Xiao and Bai, Xiang}, title = {A Simple Vision Transformer for Weakly Semi-supervised 3D Object Detection}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {8373-8383} }