Multi-Modal Fusion of Event and RGB for Monocular Depth Estimation Using a Unified Transformer-based Architecture

Anusha Devulapally, Md Fahim Faysal Khan, Siddharth Advani, Vijaykrishnan Narayanan; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 2081-2089

Abstract


In the field of robotics and autonomous navigation accurate pixel-level depth estimation has gained significant importance. Event cameras or dynamic vision sensors capture asynchronous changes in brightness at the pixel level offering benefits such as high temporal resolution no motion blur and a wide dynamic range. However unlike traditional cameras that measure absolute intensity event cameras lack the ability to provide scene context. Efficiently combining the advantages of both asynchronous events and synchronous RGB images to enhance depth estimation remains a challenge. In our study we introduce a unified transformer that combines both event and RGB modalities to achieve precise depth prediction. In contrast to individual transformers for input modalities a unified transformer model captures inter-modal dependencies and uses self-attention to enhance event-RGB contextual interactions. This approach exceeds the performance of recurrent neural network (RNN) methods used in state-of-the-art models. To encode the temporal information from events convLSTMs are used before the transformer to improve depth estimation. Our proposed architecture outperforms the existing approaches in terms of absolute mean depth error achieving state-of-the-art results in most cases. Additionally the performance is also seen in other metrics like RMSE absolute relative difference and depth thresholds compared to the existing approaches. The source code is available at https://github.com/anusha-devulapally/ER-F2D.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Devulapally_2024_CVPR, author = {Devulapally, Anusha and Khan, Md Fahim Faysal and Advani, Siddharth and Narayanan, Vijaykrishnan}, title = {Multi-Modal Fusion of Event and RGB for Monocular Depth Estimation Using a Unified Transformer-based Architecture}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {2081-2089} }