FocSAM: Delving Deeply into Focused Objects in Segmenting Anything

You Huang, Zongyu Lan, Liujuan Cao, Xianming Lin, Shengchuan Zhang, Guannan Jiang, Rongrong Ji; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 3120-3130

Abstract


The Segment Anything Model (SAM) marks a notable milestone in segmentation models highlighted by its robust zero-shot capabilities and ability to handle diverse prompts. SAM follows a pipeline that separates interactive segmentation into image preprocessing through a large encoder and interactive inference via a lightweight decoder ensuring efficient real-time performance. However SAM faces stability issues in challenging samples upon this pipeline. These issues arise from two main factors. Firstly the image preprocessing disables SAM to dynamically use image-level zoom-in strategies to refocus on the target object during interaction. Secondly the lightweight decoder struggles to sufficiently integrate interactive information with image embeddings. To address these two limitations we propose FocSAM with a pipeline redesigned on two pivotal aspects. First we propose Dynamic Window Multi-head Self-Attention (Dwin-MSA) to dynamically refocus SAM's image embeddings on the target object. Dwin-MSA localizes attention computations around the target object enhancing object-related embeddings with minimal computational overhead. Second we propose Pixel-wise Dynamic ReLU (P-DyReLU) to enable sufficient integration of interactive information from a few initial clicks that have significant impacts on the overall segmentation results. Experimentally FocSAM augments SAM's interactive segmentation performance to match the existing state-of-the-art method in segmentation quality requiring only about 5.6% of this method's inference time on CPUs. Code is available at https://github.com/YouHuang67/focsam.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Huang_2024_CVPR, author = {Huang, You and Lan, Zongyu and Cao, Liujuan and Lin, Xianming and Zhang, Shengchuan and Jiang, Guannan and Ji, Rongrong}, title = {FocSAM: Delving Deeply into Focused Objects in Segmenting Anything}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {3120-3130} }