BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model

Yiran Song, Qianyu Zhou, Xiangtai Li, Deng-Ping Fan, Xuequan Lu, Lizhuang Ma; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 3162-3173

Abstract


In this paper we address the challenge of image resolution variation for the Segment Anything Model (SAM). SAM known for its zero-shot generalizability exhibits a performance degradation when faced with datasets with varying image sizes. Previous approaches tend to resize the image to a fixed size or adopt structure modifications hindering the preservation of SAM's rich prior knowledge. Besides such task-specific tuning necessitates a complete retraining of the model which is cost-expensive and unacceptable for deployment in the downstream tasks. In this paper we reformulate this challenge as a length extrapolation problem where token sequence length varies while maintaining a consistent patch size for images with different sizes. To this end we propose a Scalable Bias-Mode Attention Mask (BA-SAM) to enhance SAM's adaptability to varying image resolutions while eliminating the need for structure modifications. Firstly we introduce a new scaling factor to ensure consistent magnitude in the attention layer's dot product values when the token sequence length changes. Secondly we present a bias-mode attention mask that allows each token to prioritize neighboring information mitigating the impact of untrained distant information. Our BA-SAM demonstrates efficacy in two scenarios: zero-shot and fine-tuning. Extensive evaluation of diverse datasets including DIS5K DUTS ISIC COD10K and COCO reveals its ability to significantly mitigate performance degradation in the zero-shot setting and achieve state-of-the-art performance with minimal fine-tuning. Furthermore we propose a generalized model and benchmark showcasing BA-SAM's generalizability across all four datasets simultaneously.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Song_2024_CVPR, author = {Song, Yiran and Zhou, Qianyu and Li, Xiangtai and Fan, Deng-Ping and Lu, Xuequan and Ma, Lizhuang}, title = {BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {3162-3173} }