-
[pdf]
[supp]
[bibtex]@InProceedings{Ranasinghe_2024_CVPR, author = {Ranasinghe, Yasiru and Hegde, Deepti and Patel, Vishal M.}, title = {MonoDiff: Monocular 3D Object Detection and Pose Estimation with Diffusion Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {10659-10670} }
MonoDiff: Monocular 3D Object Detection and Pose Estimation with Diffusion Models
Abstract
3D object detection and pose estimation from a single-view image is challenging due to the high uncertainty caused by the absence of 3D perception. As a solution recent monocular 3D detection methods leverage additional modalities such as stereo image pairs and LiDAR point clouds to enhance image features at the expense of additional annotation costs. We propose using diffusion models to learn effective representations for monocular 3D detection without additional modalities or training data. We present MonoDiff a novel framework that employs the reverse diffusion process to estimate 3D bounding box and orientation. But considering the variability in bounding box sizes along different dimensions it is ineffective to sample noise from a standard Gaussian distribution. Hence we adopt a Gaussian mixture model to sample noise during the forward diffusion process and initialize the reverse diffusion process. Furthermore since the diffusion model generates the 3D parameters for a given object image we leverage 2D detection information to provide additional supervision by maintaining the correspondence between 3D/2D projection. Finally depending on the signal-to-noise ratio we incorporate a dynamic weighting scheme to account for the level of uncertainty in the supervision by projection at different timesteps. MonoDiff outperforms current state-of-the-art monocular 3D detection methods on the KITTI and Waymo benchmarks without additional depth priors. MonoDiff project is available at: https://dylran.github.io/monodiff.github.io.
Related Material