-
[pdf]
[supp]
[bibtex]@InProceedings{Kweon_2024_ACCV, author = {Kweon, Minseong and Park, Jinsun}, title = {ULTRON: Unifying Local Transformer and Convolution for Large-scale Image Retrieval}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2024}, pages = {4000-4016} }
ULTRON: Unifying Local Transformer and Convolution for Large-scale Image Retrieval
Abstract
In large-scale image retrieval, the primary goal is to extract discriminative features and embed them into global image representations. Previous methods based on CNNs effectively learn local features and create robust representations, leading to strong performance. Transformers that excel in learning global context, however, often struggle to extract fine details and therefore do not perform well in large-scale landmark recognition. In this paper, we propose a novel hybrid architecture named ULTRON, which combines transformer blocks with local self-attention and a convolution-based encoder. Our local transformer block contains an advanced self-attention mechanism that enhances the spatial context awareness of key features and updates the value features by considering broader information within fixed-size regional windows. In addition, we have designed a channel-wise dilated convolution that adjusts dilation per channel, enabling effective multiscale feature learning while robustly capturing local features. We focus on learning local contexts throughout the entire network and effectively blending these contexts in the attention-based pooling process. This approach generates a powerful global representation that includes local information, relying solely on classification loss without requiring additional modules to capture local features. Experimental results demonstrate that our model outperforms previous works due to the effective integration of local and global information.
Related Material