SimpleClick: Interactive Image Segmentation with Simple Vision Transformers

Qin Liu, Zhenlin Xu, Gedas Bertasius, Marc Niethammer; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 22290-22300

Abstract


Click-based interactive image segmentation aims at extracting objects with a limited user clicking. A hierarchical backbone is the de-facto architecture for current methods. Recently, the plain, non-hierarchical Vision Transformer (ViT) has emerged as a competitive backbone for dense prediction tasks. This design allows the original ViT to be a foundation model that can be finetuned for downstream tasks without redesigning a hierarchical backbone for pretraining. Although this design is simple and has been proven effective, it has not yet been explored for interactive segmentation. To fill this gap, we propose SimpleClick, the first plain-backbone method for interactive segmentation. Other than the plain backbone, we also explore several variants of simple feature pyramid networks that only take as input the last feature representation of the backbone. With the plain backbone pretrained as a masked autoencoder (MAE), SimpleClick achieves state-of-the-art performance. Remarkably, our method achieves 4.15 NoC@90 on SBD, improving 21.8% over the previous best result. Extensive evaluation on medical images demonstrates the generalizability of our method. We further develop an extremely tiny ViT backbone for SimpleClick and provide a detailed computational analysis, highlighting its suitability as a practical annotation tool.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Liu_2023_ICCV, author = {Liu, Qin and Xu, Zhenlin and Bertasius, Gedas and Niethammer, Marc}, title = {SimpleClick: Interactive Image Segmentation with Simple Vision Transformers}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {22290-22300} }