-
[pdf]
[supp]
[bibtex]@InProceedings{Wang_2024_ACCV, author = {Wang, Jiayi and Liu, Zihao and Wu, Xiaoyu}, title = {LoCo-MAD: Long-Range Context-Enhanced Model Towards Plot-Centric Movie Audio Description}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2024}, pages = {1366-1383} }
LoCo-MAD: Long-Range Context-Enhanced Model Towards Plot-Centric Movie Audio Description
Abstract
Movie Audio Description (MAD) aims to enable the visually impaired community to enjoy movies by transforming them into coherent and accurate audio descriptions. Due to the extended duration and complex plot natures of movies, MAD is in the early stages of research compared to other cross-modal text generation tasks. Current MAD methods fail to model long videos efficiently or integrate long-range context to generate plot-coherent descriptions. To address these challenges, we propose a Long-Range Context-Enhanced Movie Audio Description model (LoCo-MAD), which is trained in two stages. The first stage adapts an image-text pretrained model to a Pre-aligned Movie Encoder (PME), which utilizes learnable queries to obtain compact visual representations and is supervised by three multimodal objectives. The second stage builds LoCo-MAD with the pretrained PME, a Dynamic Selection Module (DSM), and a large language model. We project visual representations from PME into soft visual prompts and utilize DSM to select the most relevant descriptions and subtitles from a long range as contextual prompts. Then, a large language model integrates these multimodal prompts and generates plot-related movie descriptions. The proposed method is extensively evaluated on MAD-v2 and LSMDC datasets, where we achieve 23.7 and 20.0 CIDEr score, respectively. Our code will be released at https://github.com/blindwang/LoCo-MAD.
Related Material