DetAny4D: Detect Anything 4D Temporally in a Streaming RGB Video

Hou, Jiawei; Zhang, Shenghao; Wang, Can; Gu, Zheng; Ling, Yonggen; Zeng, Taiping; Xue, Xiangyang; Zhang, Jingbo

Jiawei Hou, Shenghao Zhang, Can Wang, Zheng Gu, Yonggen Ling, Taiping Zeng, Xiangyang Xue, Jingbo Zhang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 32798-32807

Abstract

Reliable 4D object detection, which refers to 3D object detection in streaming video, is crucial for perceiving and understanding the real world. Existing open-set 4D object detection methods typically make predictions on a frame-by-frame basis without modeling temporal consistency, or rely on complex multi-stage pipelines that are prone to error propagation across cascaded stages. Progress in this area has been hindered by the lack of large-scale datasets that capture continuous reliable 3D bounding box (b-box) annotations. To overcome these challenges, we first introduce DA4D, a large-scale 4D detection dataset containing over 280k sequences with high-quality b-box annotations collected under diverse conditions. Building on DA4D, we propose DetAny4D, an open-set end-to-end framework that predicts 3D b-boxes directly from sequential inputs. DetAny4D fuses multi-modal features from pre-trained foundational models and designs a geometry-aware spatiotemporal decoder to effectively capture both spatial and temporal dynamics. Furthermore, it adopts a multi-task learning architecture coupled with a dedicated training strategy to maintain global consistency across sequences of varying lengths. Extensive experiments show that DetAny4D achieves competitive detection accuracy and significantly improves temporal stability, effectively addressing long-standing issues of jitter and inconsistency in 4D object detection. Data and code will be released upon acceptance.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Hou_2026_CVPR, author = {Hou, Jiawei and Zhang, Shenghao and Wang, Can and Gu, Zheng and Ling, Yonggen and Zeng, Taiping and Xue, Xiangyang and Zhang, Jingbo}, title = {DetAny4D: Detect Anything 4D Temporally in a Streaming RGB Video}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {32798-32807} }