-
[pdf]
[bibtex]@InProceedings{Sakaino_2024_CVPR, author = {Sakaino, Hidetomo and Phuong, Thao Nguyen and Duy, Vinh Nguyen}, title = {PV-Cap: 3D Dynamic Scene Understanding Through Open Physics-based Vocabulary}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {7932-7942} }
PV-Cap: 3D Dynamic Scene Understanding Through Open Physics-based Vocabulary
Abstract
Recently large Vision Language (VL) models i.e. CLIP have demonstrated impressive capabilities in training solely on internet-scale image-language pairs. Moreover almost all VL models have tackled indoor objects under controlled illumination and camera views. However outdoor 3D environments are time-varying uncontrolled scenes under natural phenomena. Therefore captions from such unseen scenes and objects are hard to obtain in a state-of-the-art (SOTA) one-shot algorithm resulting in insufficient captions. This paper proposes PV-Cap (Physics-based Vocabulary for Caption) for enhancing 3D scene understanding through enriched captions. Since many tasks to understand 3D dynamic scenes are hard to deal with PV-Cap aims to disentangle such complexities through multiple grouped Deep Learning and Vision Language models step-wisely. Proposed i-VQA (iterative VQA) and 3D-CPP (3D Contrastive Physical-Scale Pretraining) extended from SOTA 2D-CLIP also contribute to generating physical and 3D-based captions. Using many images with 3D dynamic events i.e. road scenes with traffic flow and accidents experiments have demonstrated the usability and effectiveness of proposed PV-Cap over SOTA models in terms of segmentation and captions.
Related Material