TMVNet: Using Transformers for Multi-View Voxel-Based 3D Reconstruction
Previous research in multi-view 3D reconstruction had used different convolution neural network (CNN) architectures to obtain a 3D voxel representation. Even though CNN works well, they have limitations in exploiting the long-range dependencies in sequence transduction tasks such as multi-view 3D reconstruction. In this paper, we propose TMVNet -- a two-layer transformer encoder that can better use long-range dependencies information. In contrast to using a 2D CNN decoder by the previous approaches, our model uses a 3D CNN encoder to capture the relations between the voxels in the 3D space. Also, our proposed 3D feature fusion network aggregates 3D position feature from CNN and long-range dependencies feature from transformer together. The proposed TMVNet is trained and tested on the ShapeNet dataset. Comparison against ten state-of-the-art multi-view 3D reconstruction methods and the reported quantitative and qualitative results showcase the superiority of our method.