Text Query to Web Image to Video: A Comprehensive Ad-hoc Video Search

Nhat-Minh Nguyen, Tien-Dung Mai, Duy-Dinh Le; Proceedings of the Asian Conference on Computer Vision (ACCV), 2024, pp. 4141-4155

Abstract


In this study, we propose a novel approach for Ad-hoc Video Search that leverages the power of image search engines to synthesize query images for corresponding textual sentence query. Existing methods primarily rely on pre-trained language-image models to extract features from textual queries and video keyframes of video segments. While recent approaches using generative models to generate visual representations based on text descriptions show promise, they are limited by diversity, authenticity, speed, and hardware requirements. In contrast, our proposed method leverages the vast and diverse image database available on the Internet through image search engines to directly synthesize query images based on input text descriptions. Moreover, to enhance computational efficiency, each video segment is represented by only single keyframe. Specifically, we use only two general-purpose multimodal models for extracting feature embeddings for textual queries, query images, and keyframes. To return a list of relevant video segments for each query, we compute the weighted average similarity between each keyframe and both the textual query and query images. Experiments conducted on the TRECVID dataset (V3C2) and main set of textual queries from 2022 and 2023 demonstrate the efficiency of our method.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Nguyen_2024_ACCV, author = {Nguyen, Nhat-Minh and Mai, Tien-Dung and Le, Duy-Dinh}, title = {Text Query to Web Image to Video: A Comprehensive Ad-hoc Video Search}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2024}, pages = {4141-4155} }