In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval

Nina Shvetsova, Anna Kukleva, Bernt Schiele, Hilde Kuehne; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 21981-21992


Large-scale noisy web image-text datasets have been proven to be efficient for learning robust vision-language models. However, to transfer them to the task of video retrieval, models still need to be fine-tuned on hand-curated paired text-video data to adapt to the diverse styles of video descriptions. To address this problem without the need for hand-annotated pairs, we propose a new setting, text-video retrieval with uncurated & unpaired data, that uses only text queries together with uncurated web videos during training without any paired text-video data. To this end, we propose an approach, In-Style, that learns the style of the text queries and transfers it to uncurated web videos. Moreover, to improve generalization, we show that one model can be trained with multiple text styles. To this end, we introduce a multi-style contrastive training procedure, that improves the generalizability over several datasets simultaneously. We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework on the new task of uncurated & unpaired text-video retrieval and improve state-of-the-art performance on zero-shot text-video retrieval.

Related Material

[pdf] [supp]
@InProceedings{Shvetsova_2023_ICCV, author = {Shvetsova, Nina and Kukleva, Anna and Schiele, Bernt and Kuehne, Hilde}, title = {In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {21981-21992} }