AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description

Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman; Proceedings of the Asian Conference on Computer Vision (ACCV), 2024, pp. 2265-2281

Abstract


Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a training-free manner. We use the power of off-the-shelf Video Language Models (VLMs) and Large Language Models (LLMs), and develop visual and text prompting strategies for this task. Our contributions are three-fold: (i) We demonstrate that a VLM can successfully name and refer to characters if directly prompted with character information through visual indications without requiring any fine-tuning; (ii) A two-stage process is developed to generate ADs, with the first stage asking the VLM to comprehensively describe the video, followed by a second stage utilising a LLM to summarise dense textual information into one succinct AD sentence; (iii) A new dataset for TV audio description is formulated. Our approach, named AutoAD-Zero, demonstrates outstanding performance (even competitive with some models fine-tuned on ground-truth ADs) in AD generation for both movies and TV series, achieving state-of-the-art CRITIC scores.

Related Material


[pdf]
[bibtex]
@InProceedings{Xie_2024_ACCV, author = {Xie, Junyu and Han, Tengda and Bain, Max and Nagrani, Arsha and Varol, G\"ul and Xie, Weidi and Zisserman, Andrew}, title = {AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2024}, pages = {2265-2281} }