CH3Depth: Efficient and Flexible Depth Foundation Model with Flow Matching

Li, Jiaqi; Wang, Yiran; Zheng, Jinghong; Zhang, Junrui; Shen, Liao; Liu, Tianqi; Cao, Zhiguo

Jiaqi Li, Yiran Wang, Jinghong Zheng, Junrui Zhang, Liao Shen, Tianqi Liu, Zhiguo Cao; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 7222-7232

Abstract

Depth estimation is a fundamental task in 3D vision. An ideal depth estimation model is expected to embrace meticulous detail, temporal consistency, and high efficiency. Although existing foundation models can perform well in certain specific aspects, most of them fall short of fulfilling all the above requirements simultaneously. In this paper, we present CH_3Depth, an efficient and flexible model for depth estimation with flow matching to address this challenge. Specifically, 1) we reframe the optimization objective of flow matching as the Inversion by Direct Iteration (InDI) to improve accuracy. 2) To enhance efficiency, we propose non-uniform sampling to achieve better prediction with fewer sampling steps. 3) We design the Latent Temporal Stabilizer (LTS) to enhance temporal consistency by aggregating latent codes of adjacent frames, enabling our method to be lightweight and compatible for video depth estimation. CH_3Depth achieves state-of-the-art performance in zero-shot evaluations across multiple image and video datasets, excelling in prediction accuracy, efficiency, and temporal consistency, highlighting its potential as the next foundation model for depth estimation.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Li_2025_CVPR, author = {Li, Jiaqi and Wang, Yiran and Zheng, Jinghong and Zhang, Junrui and Shen, Liao and Liu, Tianqi and Cao, Zhiguo}, title = {CH3Depth: Efficient and Flexible Depth Foundation Model with Flow Matching}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, pages = {7222-7232} }