Cap2Aug: Caption Guided Image Data Augmentation

Aniket Roy, Anshul Shah, Ketul Shah, Anirban Roy, Rama Chellappa; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 9107-9117


Visual recognition in a low-data regime is challenging and often prone to overfitting. To mitigate this issue several data augmentation strategies have been proposed. However standard transformations e.g. rotation cropping and flipping provide limited semantic variations. To this end we propose Cap2Aug an image-to-image diffusion model-based data augmentation strategy using image captions to condition the image synthesis step. We generate a caption for an image and use this caption as an additional input for an image-to-image diffusion model. This increases the semantic diversity of the augmented images due to caption conditioning compared to the usual data augmentation techniques. We show that Cap2Aug is particularly effective where only a few samples are available for an object class. However naively generating the synthetic images is not adequate due to the domain gap between real and synthetic images. Thus we employ a maximum mean discrepancy loss to align the synthetic images to the real images to minimize the domain gap. We evaluate our method on few-shot classification and image classification with long-tail class distribution tasks. Cap2Aug achieves state-of-the-art performance on both tasks while evaluated on eleven benchmarks. Code:

Related Material

@InProceedings{Roy_2025_WACV, author = {Roy, Aniket and Shah, Anshul and Shah, Ketul and Roy, Anirban and Chellappa, Rama}, title = {Cap2Aug: Caption Guided Image Data Augmentation}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {9107-9117} }