-
[pdf]
[supp]
[bibtex]@InProceedings{Ee_2025_WACV, author = {Ee, Yeo Keat and Zhang, Hao and Matyasko, Alexander and Fernando, Basura}, title = {Deduce and Select Evidences with Language Models for Training-Free Video Goal Inference}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {5937-5947} }
Deduce and Select Evidences with Language Models for Training-Free Video Goal Inference
Abstract
We introduce ViDSE a Video framework that Deduce and Selects visual Evidence for training-free video goal inference using language models. Unlike approaches that directly apply vision-language models (VLM) or combine VLM+LLM to process dense video visuals ViDSE explicitly selects relevant visual evidence (e.g. frames) based on the hypothesis deduced by the LLM. This approach not only improves accuracy but also reveals the logical process behind the model's decisions enhancing explainability. Our experiments demonstrate that this selection process significantly reduces ambiguity in the subsequent inference reasoning stage and outperforms VLM-only and VLM+LLM models on goal inference tasks such as CrossTask and COIN. We further validate ViDSE's generalizability and robustness on action recognition tasks such as ActivityNet and UCF-101 under training-free and open-vocabulary conditions. We observe that ViDSE easily generalizes to other video tasks (e.g. action recognition) requiring filtering of redundant and irrelevant information.
Related Material