- [pdf] [supp] [code]
OVPT: Optimal Viewset Pooling Transformer for 3D Object Recognition
The current methods for multi-view-based 3D object recognition have the problem of losing the correlation between views and rendering 3D objects with multi-view redundancy. This makes it difficult to improve recognition performance and unnecessarily increases the computational cost and running time of the network. Especially in the case of limited computing resources, the recognition performance is further affected. Our study developed an optimal viewset pooling transformer (OVPT) method for efficient and accurate 3D object recognition. The OVPT method constructs the optimal viewset based on information entropy to reduce the redundancy of the multi-view scheme. We used convolutional neural network (CNN) to extract the multi-view low-level local features of the optimal viewset. Embedding class token into the headers of multi-view low-level local features and splicing with position encoding generates local-view token sequences. This sequence was trained parallel with a pooling transformer to generate a local view information token sequence. At the same time, the global class token captured the global feature information of the local view token sequence. The two were aggregated next into a single compact 3D global feature descriptor. On two public benchmarks, ModelNet10 and ModelNet40, for each 3D object we only need a smaller number of optimal viewsets, achieving an overall recognition accuracy (OA) of 99.33% and 97.48%, respectively. Compared with other deep learning methods, our method still achieves state-of-the-art performance with limited computational resources. Our source code is available at https://github.com/shepherds001/OVPT.