Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis

Gu, Yuchao; Wang, Xintao; Ge, Yixiao; Shan, Ying; Shou, Mike Zheng

Yuchao Gu, Xintao Wang, Yixiao Ge, Ying Shan, Mike Zheng Shou; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 7631-7640

Abstract

Vector-Quantized (VQ-based) generative models usually consist of two basic components i.e. VQ tokenizers and generative transformers. Prior research focuses on improving the reconstruction fidelity of VQ tokenizers but rarely examines how the improvement in reconstruction affects the generation ability of generative transformers. In this paper we find that improving the reconstruction fidelity of VQ tokenizers does not necessarily improve the generation. Instead learning to compress semantic features within VQ tokenizers significantly improves generative transformers' ability to capture textures and structures. We thus highlight two competing objectives of VQ tokenizers for image synthesis: semantic compression and details preservation. Different from previous work that prioritizes better details preservation we propose Semantic-Quantized GAN (SeQ-GAN) with two learning phases to balance the two objectives. In the first phase we propose a semantic-enhanced perceptual loss for better semantic compression. In the second phase we fix the encoder and codebook but finetune the decoder to achieve better details preservation. Our proposed SeQ-GAN significantly improves VQ-based generative models for both unconditional and conditional image generation. Specifically SeQ-GAN achieves a Frechet Inception Distance (FID) of 6.25 and Inception Score (IS) of 140.9 on 256x256 ImageNet generation a remarkable improvement over VIT-VQGAN which obtains 11.2 FID and 97.2 IS.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Gu_2024_CVPR, author = {Gu, Yuchao and Wang, Xintao and Ge, Yixiao and Shan, Ying and Shou, Mike Zheng}, title = {Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {7631-7640} }