Beyond Coarse-Grained Matching in Video-Text Retrieval

Aozhu Chen, Hazel Doughty, Xirong Li, Cees G. M. Snoek; Proceedings of the Asian Conference on Computer Vision (ACCV), 2024, pp. 71-87

Abstract


Text-to-video retrieval has seen significant advancements, yet the ability of models to discern subtle differences in captions still requires verification. In this paper, we introduce a new metric for fine-grained evaluation that can be applied to existing datasets by automatically generating hard negative test captions with subtle single-word variations across noun, verb, adjective, adverb, and preposition. We perform comprehensive experiments using four state-of-the-art models across two standard benchmarks (MSR-VTT and VATEX) and two specially curated datasets enriched with detailed descriptions (VLN-UVO and VLN-OOPS), resulting in a number of novel findings and insights: 1) our analyses show that the current evaluation benchmarks fall short in detecting a model's ability to perceive subtle single-word differences, 2) our fine-grained evaluation highlights the difficulty models face in distinguishing such subtle variations. To enhance fine-grained understanding, we propose a new baseline that can be easily combined with current methods. Results from experiments on this fine-grained evaluation demonstrate that our approach clearly enhances a models' ability to understand fine-grained differences.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Chen_2024_ACCV, author = {Chen, Aozhu and Doughty, Hazel and Li, Xirong and Snoek, Cees G. M.}, title = {Beyond Coarse-Grained Matching in Video-Text Retrieval}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2024}, pages = {71-87} }