TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation

Moon Ye-Bin, Jisoo Kim, Hongyeob Kim, Kilho Son, Tae-Hyun Oh; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 2526-2537

Abstract


We propose TextManiA, a text-driven manifold augmentation method that semantically enriches visual feature spaces, regardless of class distribution. TextManiA augments visual data with intra-class semantic perturbation by exploiting easy-to-understand visually mimetic words, i.e., attributes. This work is built on an interesting hypothesis that general language models, e.g., BERT and GPT, encompass visual information to some extent, even without training on visual training data. Given the hypothesis, TextManiA transfers pre-trained text representation obtained from a well-established large language encoder to a target visual feature space being learned. Our extensive analysis hints that the language encoder indeed encompasses visual information at least useful to augment visual representation. Our experiments demonstrate that TextManiA is particularly powerful in scarce samples with class imbalance as well as even distribution. We also show compatibility with the label mix-based approaches in evenly distributed scarce data.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Ye-Bin_2023_ICCV, author = {Ye-Bin, Moon and Kim, Jisoo and Kim, Hongyeob and Son, Kilho and Oh, Tae-Hyun}, title = {TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {2526-2537} }