Monocular 3D Localization of Vehicles in Road Scenes

Haotian Zhang, Haorui Ji, Aotian Zheng, Jenq-Neng Hwang, Ren-Hung Hwang; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2021, pp. 2855-2864


Sensing and perception systems for autonomous driving vehicles in road scenes are composed of three crucial components: 3D-based object detection, tracking, and localization. While all three components are important, most relevant papers tend to only focus on one single component. We propose a monocular vision-based framework for 3D-based detection, tracking, and localization by effectively integrating all three tasks in a complementary manner. Our system contains an RCNN-based Localization Network (LOCNet), which works in concert with fitness evaluation score (FES) based single-frame optimization, to get more accurate and refined 3D vehicle localization. To better utilize the temporal information, we further use a multi-frame optimization technique, taking advantage of camera ego-motion and a 3D TrackletNet Tracker (3D TNT), to improve both accuracy and consistency in our 3D localization results. Our system outperforms state-of-the-art image-based solutions in diverse scenarios and is even comparable with LiDAR-based methods.

Related Material

@InProceedings{Zhang_2021_ICCV, author = {Zhang, Haotian and Ji, Haorui and Zheng, Aotian and Hwang, Jenq-Neng and Hwang, Ren-Hung}, title = {Monocular 3D Localization of Vehicles in Road Scenes}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2021}, pages = {2855-2864} }