Dynamic Depth Fusion and Transformation for Monocular 3D Object Detection

Erli Ouyang, Li Zhang, Mohan Chen, Anurag Arnab, Yanwei Fu; Proceedings of the Asian Conference on Computer Vision (ACCV), 2020

Abstract


Removing particular objects in a video and filling up the corresponding blank regions with a plausible background is a challenging and often ill-posed task. In this paper, we propose a framework to solve this difficult problem in complex, dynamic scenes by leveraging multi-view geometry and convolutional neural networks based approaches. Given an input video with undesired object masks, we first extract the depth map and relative camera pose for each of the input frames. We then fuse the estimated depth and pose to create a global 3D scene reconstruction. By projecting the point clouds from the reconstructed grid volume, we can fill in the most of the regions masked in the original input. We then use learning-based approaches to inpaint the remaining pixels in the input video which could not be resolved by 3D reconstruction. Compared with previous video inpainting approaches, our system generates superior qualitative results on the DAVIS 2016 and KITTI datasets, particularly in scenes where multiple, large objects are removed.

Related Material


[pdf]
[bibtex]
@InProceedings{Ouyang_2020_ACCV, author = {Ouyang, Erli and Zhang, Li and Chen, Mohan and Arnab, Anurag and Fu, Yanwei}, title = {Dynamic Depth Fusion and Transformation for Monocular 3D Object Detection}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {November}, year = {2020} }