M-RAT: a Multi-grained Retrieval Augmentation Transformer for Image Captioning

Jiayan Song, Renjie Pan, Jun Zhou, Hua Yang; Proceedings of the Asian Conference on Computer Vision (ACCV), 2024, pp. 3865-3882

Abstract


Current encoder-decoder methods for image captioning mainly consist of an object detection module (two-stage), or rely on big models with large-scale datasets to improve the effectiveness, which leads to increasing computation costs and cannot introduce new external knowledge. In this paper, we propose a novel end-to-end method Multi-grained Retrieval Augmentation Transformer (M-RAT) that innovatively fuses retrieved text derived from a changeable datastore with input visual feature through a Multi-modal Aligned Encoder, and introduce a specialized attention mechanism, Multi-MSA, to exploit both local and global interactions for delicate fine-grained details. Additionally, we enhance the decoder generation ability by employing low-level and high-level fused embeddings. Experiments demonstrate that M-RAT achieves comparable performance to state-of-the-art baselines with remarkable accuracy and details, as well as showing excellent domain adaptability for novel objects.

Related Material


[pdf]
[bibtex]
@InProceedings{Song_2024_ACCV, author = {Song, Jiayan and Pan, Renjie and Zhou, Jun and Yang, Hua}, title = {M-RAT: a Multi-grained Retrieval Augmentation Transformer for Image Captioning}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2024}, pages = {3865-3882} }