LLaViLo: Boosting Video Moment Retrieval via Adapter-Based Multimodal Modeling

Kaijing Ma, Xianghao Zang, Zerun Feng, Han Fang, Chao Ban, Yuhan Wei, Zhongjiang He, Yongxiang Li, Hao Sun; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2023, pp. 2798-2803

Abstract


Recent studies have explored the potential of large language models (LLMs) for understanding the semantic information in images. However, the use of LLMs to understand videos, which contain continuous contextual information, remains limited. In this paper, we propose LLaViLo (LLaMa-Video-Localizer), a video moment retrieval pipeline powered by a large language model. LLaViLo has two key features: 1) In contrast to fine-tuning the entire LLM, we introduce and optimize only 1.7% of additional parameters in adapter modules, freezing the pre-trained LLM to enable efficient alignment of video and text. 2) A multi-objective optimization framework concurrently optimizes two objectives: a set prediction objective and a captioning objective. The joint training of these two objectives allows the proposed framework to produce high-quality time coordinates. Compared with other state-of-the-art methods, the proposed LLaViLo achieves significant performance improvement on QVHighlights and Charades-STA datasets.

Related Material


[pdf]
[bibtex]
@InProceedings{Ma_2023_ICCV, author = {Ma, Kaijing and Zang, Xianghao and Feng, Zerun and Fang, Han and Ban, Chao and Wei, Yuhan and He, Zhongjiang and Li, Yongxiang and Sun, Hao}, title = {LLaViLo: Boosting Video Moment Retrieval via Adapter-Based Multimodal Modeling}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2023}, pages = {2798-2803} }