All in Tokens: Unifying Output Space of Visual Tasks via Soft Token

Ning, Jia; Li, Chen; Zhang, Zheng; Wang, Chunyu; Geng, Zigang; Dai, Qi; He, Kun; Hu, Han

Jia Ning, Chen Li, Zheng Zhang, Chunyu Wang, Zigang Geng, Qi Dai, Kun He, Han Hu; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 19900-19910

Abstract

We introduce AiT, a unified output representation for various vision tasks, which is a crucial step towards general-purpose vision task solvers. Despite the challenges posed by the high-dimensional and task-specific outputs, we showcase the potential of using discrete representation (VQ-VAE) to model the dense outputs of many computer vision tasks as a sequence of discrete tokens. This is inspired by the established ability of VQ-VAE to conserve the structures spanning multiple pixels using few discrete codes. To that end, we present a modified shallower architecture for VQ-VAE that improves efficiency while keeping prediction accuracy. Our approach also incorporates uncertainty into the decoding process by using a soft fusion of the codebook entries, providing a more stable training process, which notably improved prediction accuracy. Our evaluation of AiT on depth estimation and instance segmentation tasks, with both continuous and discrete labels, demonstrates its superiority compared to other unified models. The code and models are available at https://github.com/SwinTransformer/AiT.

Related Material

[pdf] [arXiv]

[bibtex]

@InProceedings{Ning_2023_ICCV, author = {Ning, Jia and Li, Chen and Zhang, Zheng and Wang, Chunyu and Geng, Zigang and Dai, Qi and He, Kun and Hu, Han}, title = {All in Tokens: Unifying Output Space of Visual Tasks via Soft Token}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {19900-19910} }