Fine-Tuning Image Transformers Using Learnable Memory

Mark Sandler, Andrey Zhmoginov, Max Vladymyrov, Andrew Jackson; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 12155-12164


In this paper we propose augmenting Vision Transformer models with learnable memory tokens. Our approach allows the model to adapt to new tasks, using few parameters, while optionally preserving its capabilities on previously learned tasks. At each layer we introduce a set of learnable embedding vectors that provide contextual information useful for specific datasets. We call these 'memory tokens'. We show that augmenting a model with just a handful of such tokens per layer significantly improves accuracy when compared to conventional head-only fine-tuning, and performs only slightly below the significantly more expensive full fine-tuning. We then propose an attention-masking approach that enables models to preserve their previous capabilities, while extending them to new downstream tasks. This approach, which we call 'non-destructive fine-tuning', enables computation reuse across multiple tasks while being able to learn new tasks independently.

Related Material

[pdf] [supp] [arXiv]
@InProceedings{Sandler_2022_CVPR, author = {Sandler, Mark and Zhmoginov, Andrey and Vladymyrov, Max and Jackson, Andrew}, title = {Fine-Tuning Image Transformers Using Learnable Memory}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022}, pages = {12155-12164} }