-
[pdf]
[supp]
[bibtex]@InProceedings{Kim_2024_ACCV, author = {Kim, Janghyun and Shin, Ukcheol and Heo, Seokyong and Park, Jinsun}, title = {Exploiting Cross-modal Cost Volume for Multi-sensor Depth Estimation}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2024}, pages = {1420-1436} }
Exploiting Cross-modal Cost Volume for Multi-sensor Depth Estimation
Abstract
Single-modal depth estimation has shown steady improvement over the years. However, relying solely on a single imaging sensor such as RGB and near-infrared (NIR) cameras can result in unreliable and erroneous depth estimation, particularly in challenging lighting conditions such as low-light or sudden lighting change scenarios. Thereby, several approaches have leveraged multiple sensors for robust depth estimation. However, the effective fusion method that maximally utilizes multi-modal sensor information still requires further investigation. With this in mind, we propose a multi-modal cost volume fusion strategy with cross-modal attention, incorporating information from both cross-spectral and single-modality pairs. Our method initially constructs low-level cost volumes that consist of modality-specific (i.e., single modality) and modality-invariant (i.e., cross-spectral) volumes from multi-modal sensors. These cost volumes are then gradually fused using bidirectional cross-modal fusion and unidirectional LiDAR fusion to generate a multi-sensory cost volume. Furthermore, we introduce a straightforward domain gap reduction approach to learn modality-invariant features and depth refinement techniques through cost volume-guided propagation. Experimental results demonstrate that our method achieves SOTA (State-of-the-Art) performance under diverse environmental changes.
Related Material