Discrete Cosin TransFormer: Image Modeling From Frequency Domain

Xinyu Li, Yanyi Zhang, Jianbo Yuan, Hanlin Lu, Yibo Zhu; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 5468-5478

Abstract


In this paper, we propose Discrete Cosin TransFormer (DCFormer) that directly learn semantics from DCT-based frequency domain representation. We first show that transformer-based networks are able to learn semantics directly from frequency domain representation based on discrete cosine transform (DCT) without compromising the performance. To achieve the desired efficiency-effectiveness trade-off, we then leverage an input information compression on its frequency domain representation, which highlights the visually significant signals inspired by JPEG compression. We explore different frequency domain down-sampling strategies and show that it is possible to preserve the semantic meaningful information by strategically dropping the high-frequency components. The proposed DCFormer is tested on various downstream tasks including image classification, object detection and instance segmentation, and achieves state-of-the-art comparable performance with less FLOPs, and outperforms the commonly used backbone (e.g. SWIN) at similar FLOPs. Our ablation results also show that the proposed method generalizes well on different transformer backbones.

Related Material


[pdf]
[bibtex]
@InProceedings{Li_2023_WACV, author = {Li, Xinyu and Zhang, Yanyi and Yuan, Jianbo and Lu, Hanlin and Zhu, Yibo}, title = {Discrete Cosin TransFormer: Image Modeling From Frequency Domain}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2023}, pages = {5468-5478} }