TMVNet: Using Transformers for Multi-View Voxel-Based 3D Reconstruction

Kebin Peng, Rifatul Islam, John Quarles, Kevin Desai; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2022, pp. 222-230

Abstract


Previous research in multi-view 3D reconstruction had used different convolution neural network (CNN) architectures to obtain a 3D voxel representation. Even though CNN works well, they have limitations in exploiting the long-range dependencies in sequence transduction tasks such as multi-view 3D reconstruction. In this paper, we propose TMVNet -- a two-layer transformer encoder that can better use long-range dependencies information. In contrast to using a 2D CNN decoder by the previous approaches, our model uses a 3D CNN encoder to capture the relations between the voxels in the 3D space. Also, our proposed 3D feature fusion network aggregates 3D position feature from CNN and long-range dependencies feature from transformer together. The proposed TMVNet is trained and tested on the ShapeNet dataset. Comparison against ten state-of-the-art multi-view 3D reconstruction methods and the reported quantitative and qualitative results showcase the superiority of our method.

Related Material


[pdf]
[bibtex]
@InProceedings{Peng_2022_CVPR, author = {Peng, Kebin and Islam, Rifatul and Quarles, John and Desai, Kevin}, title = {TMVNet: Using Transformers for Multi-View Voxel-Based 3D Reconstruction}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2022}, pages = {222-230} }