Instance-Level Video Depth in Groups Beyond Occlusions

Yuan Liang, Yang Zhou, Ziming Sun, Tianyi Xiang, Guiqing Li, Shengfeng He; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 7581-7591

Abstract


Depth estimation in dynamic, multi-object scenes remains a major challenge, especially under severe occlusions. Existing monocular models, including foundation models, struggle with instance-wise depth consistency due to their reliance on global regression. We tackle this problem from two key aspects: data and methodology. First, we introduce the Group Instance Depth (GID) dataset, the first large-scale video depth dataset with instance-level annotations, featuring 101,500 frames from real-world activity scenes. GID bridges the gap between synthetic and real-world depth data by providing high-fidelity depth supervision for multi-object interactions. Second, we propose InstanceDepth, the first occlusion-aware depth estimation framework for multi-object environments. Our two-stage pipeline consists of (1) Holistic Depth Initialization, which assigns a coarse scene-level depth structure, and (2) Instance-Aware Depth Rectification, which refines instance-wise depth using object masks, shape priors, and spatial relationships. By enforcing geometric consistency across occlusions, our method sets a new state-of-the-art on the GID dataset and multiple benchmarks.

Related Material


[pdf]
[bibtex]
@InProceedings{Liang_2025_ICCV, author = {Liang, Yuan and Zhou, Yang and Sun, Ziming and Xiang, Tianyi and Li, Guiqing and He, Shengfeng}, title = {Instance-Level Video Depth in Groups Beyond Occlusions}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {7581-7591} }