-
[pdf]
[bibtex]@InProceedings{Yousaf_2025_ICCV, author = {Yousaf, Adeel and Shah, Mubarak}, title = {Enhancing Vision-Language Models for Zero-Shot Video Action Recognition via Visual-Textual Refinement and Improved Interpretability}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {331-340} }
Enhancing Vision-Language Models for Zero-Shot Video Action Recognition via Visual-Textual Refinement and Improved Interpretability
Abstract
Extending pre-trained image-based vision-language models (VLMs), such as CLIP, to video action recognition is an important yet challenging problem. Recent zero-shot approaches have enriched the textual class representations using large language models (LLMs) to generate more descriptive labels, but these methods were designed for image-based tasks and leave the visual features untouched--failing to leverage the temporal and semantic richness of videos. We propose a unified, zero-shot framework that extends pre-trained image-based VLMs to video action recognition by jointly enhancing both visual and textual class representations, while providing inherent interpretability. On the visual side, we use a video-to-text model to generate natural language summaries of the query video, capturing fine-grained spatio-temporal cues that complement the visual embeddings. On the textual side, we prompt an LLM to produce action-specific descriptors--including language attributes, descriptions, and hierarchical action contexts--that enrich the class label representations and improve semantic alignment. We evaluate our framework in the zero-shot setting on four standard video action recognition benchmarks--Kinetics-400, UCF-101, HMDB-51, and Something-Something-V2--using four diverse VLM backbones. Our method is fully compatible with pre-trained image-based VLMs, enabling them to be effectively extended to video action recognition without video training, and it also improves performance when applied to VLMs already adapted to video. By design, our framework also provides transparent, human-understandable justifications for each prediction.
Related Material
