-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Zeng_2024_CVPR, author = {Zeng, Zequn and Xie, Yan and Zhang, Hao and Chen, Chiyu and Chen, Bo and Wang, Zhengjue}, title = {MeaCap: Memory-Augmented Zero-shot Image Captioning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {14100-14110} }
MeaCap: Memory-Augmented Zero-shot Image Captioning
Abstract
Zero-shot image captioning (IC) without well-paired image-text data can be categorized into two main types: training-free and text-only-training methods. While both types integrate pre-trained vision-language models such as CLIP for image-text similarity evaluation and a pre-trained language model (LM) for caption generation their distinction lies in the utilization of textual corpus for LM training. Despite achieving promising performance on certain metrics existing methods commonly suffer from drawbacks. Training-free methods often generate hallucinations whereas text-only-training methods may lack generalization capability. To address these challenges we propose a novel Memory-Augmented zero-shot image Captioning framework (MeaCap). This framework equipped with a textual memory incorporates a retrieve-then-filter module to extract key concepts highly relevant to the image. By leveraging our proposed memory-augmented visual-related fusion score within a keywords-to-sentence LM MeaCap generates concept-centered captions that exhibit high consistency with the image with reduced hallucinations and enriched world knowledge. MeaCap achieves state-of-the-art performance across various zero-shot IC settings. Our code is publicly available at https://github.com/joeyz0z/MeaCap.
Related Material