Questioning, Answering, and Captioning for Zero-Shot Detailed Image Caption

Duc-Tuan Luu, Viet-Tuan Le, Duc Minh Vo; Proceedings of the Asian Conference on Computer Vision (ACCV) Workshops, 2024, pp. 242-259

Abstract


End-to-end pre-trained large vision language models (VLMs) have made unprecedented progress in image captioning. Nonetheless, they struggle to generate detailed captions, which necessitate the models capturing spatial relations, counting, text rendering, world knowledge, and other presenting or not presenting aspects of the image. To overcome their inadequacies, we present a Question - Answer - Caption methodology, named QAC, that performs questioning and answering on many aspects of the given image, followed by captions based on the responses. Specifically, we use ChatGPT to produce a set of questions about the images' content. The questions are then answered using a pre-trained VLM. After gathering all answers, we prompt the pre-trained VLM to generate descriptive captions in a zero-shot setting. Our approach is plug-and-play and can be easily applied on any pre-trained VLM. We implement QAC on InstructBLIP and LLaVA, demonstrating comparable performance to fine-tuned models on a challenging DOCCI dataset.

Related Material


[pdf]
[bibtex]
@InProceedings{Luu_2024_ACCV, author = {Luu, Duc-Tuan and Le, Viet-Tuan and Vo, Duc Minh}, title = {Questioning, Answering, and Captioning for Zero-Shot Detailed Image Caption}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV) Workshops}, month = {December}, year = {2024}, pages = {242-259} }