-
[pdf]
[supp]
[bibtex]@InProceedings{Zhu_2025_ICCV, author = {Zhu, Wenxuan and Li, Bing and Zheng, Cheng and Mai, Jinjie and Chen, Jun and Jiang, Letian and Hamdi, Abdullah and Martinez, Sara Rojas and Lin, Chia-Wen and Elhoseiny, Mohamed and Ghanem, Bernard}, title = {4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {21129-21143} }
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated impressive 2D image/video understanding capabilities.However, there are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects.In this paper, we introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs in 4D object understanding, featuring tasks in 4D object Question Answering (4D object QA) and 4D object captioning.4D-Bench provides 4D objects with diverse categories, high-quality annotations, and tasks necessitating multi-view spatial-temporal understanding, different from existing 2D image/video-based benchmarks.With 4D-Bench, we evaluate a wide range of open-source and closed-source MLLMs.The results from the 4D object captioning experiment indicate that MLLMs generally exhibit weaker temporal understanding compared to their appearance understanding, notably, while open-source models approach closed-source performance in appearance understanding, they show larger performance gaps in temporal understanding.4D object QA yields surprising findings: even with simple single-object videos, MLLMs perform poorly, with state-of-the-art GPT-4o achieving only 63% accuracy compared to the human baseline of 91%.These findings highlight a substantial gap in 4D object understanding and the need for further advancements in MLLMs.
Related Material
