Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning

Shiming Chen, Wenjin Hou, Salman Khan, Fahad Shahbaz Khan; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 23964-23974

Abstract


Zero-shot learning (ZSL) recognizes the unseen classes by conducting visual-semantic interactions to transfer semantic knowledge from seen classes to unseen ones supported by semantic information (e.g. attributes). However existing ZSL methods simply extract visual features using a pre-trained network backbone (i.e. CNN or ViT) which fail to learn matched visual-semantic correspondences for representing semantic-related visual features as lacking of the guidance of semantic information resulting in undesirable visual-semantic interactions. To tackle this issue we propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT). ZSLViT mainly considers two properties in the whole network: i) discover the semantic-related visual representations explicitly and ii) discard the semantic-unrelated visual information. Specifically we first introduce semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement and discover the semantic-related visual tokens explicitly with semantic-guided token attention. Then we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement. These two operations are integrated into various encoders to progressively learn semantic-related visual representations for accurate visual-semantic interactions in ZSL. The extensive experiments show that our ZSLViT achieves significant performance gains on three popular benchmark datasets i.e. CUB SUN and AWA2.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Chen_2024_CVPR, author = {Chen, Shiming and Hou, Wenjin and Khan, Salman and Khan, Fahad Shahbaz}, title = {Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {23964-23974} }