Attend and Replay: Efficient Action Understanding in Long Videos via Mechanistic Interpretability

Puyue Hou, Jinjin Zhang, Di Huang; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025, pp. 6602-6611

Abstract


Efficient action understanding in long videos remains a significant challenge for multimodal large language models (MLLMs), primarily due to the difficulty in localizing target actions within long frame sequences. This stems from the overwhelming interference of unrelated actions in long videos. In this work, we focus on efficient action localization to enhance video understanding from an internal interpretability perspective, leveraging the intricate relationship between text and video tokens to remove irrelevant tokens. By tracing attention distribution across videos of varying frame lengths, we observe that unsuccessful action understanding directly correlates with unrelated actions that receive notable attention scores. Motivated by these findings, we propose an Attend and Replay method that efficiently locates critical action information and strengthens its semantic representation. This approach first reduces unrelated action tokens using an attention-guided spatiotemporal pruning strategy, then enriches target action tokens via a pivot-token aggregation method. Extensive experiments show that integrating our method with existing MLLMs (e.g., LLava-Video, Qwen2.5-VL, MiMo-VL) achieves superior performance against other counterparts on various datasets, while enjoys lightning inference speed.

Related Material


[pdf]
[bibtex]
@InProceedings{Hou_2025_ICCV, author = {Hou, Puyue and Zhang, Jinjin and Huang, Di}, title = {Attend and Replay: Efficient Action Understanding in Long Videos via Mechanistic Interpretability}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {6602-6611} }