LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant

Li, Wei; Hu, Bing; Shao, Rui; Shen, Leyang; Nie, Liqiang

Wei Li, Bing Hu, Rui Shao, Leyang Shen, Liqiang Nie; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 3240-3251

Abstract

First-person video assistants are highly anticipated to enhance our daily life through online video dialogue. However, existing online video assistants often sacrifice assistant efficacy for real-time efficiency by processing low-frame-rate videos with coarse-grained visual features. To overcome the trade-off between efficacy and efficiency, we propose "**F**ast & **S**low Video-Language Thinker" as on**LI**ne vide**O** assista**N**t, **LION-FS**, achieving real-time, proactive, temporally accurate, and contextually precise responses. LION-FS adopts a two-stage optimization strategy: **1) Fast Path: Routing-Based Response Determination** evaluates frame-by-frame whether a immediate response is necessary. To enhance responses determination accuracy and handle higher frame-rate inputs efficiently, we employ Token Aggregation Routing to dynamically fuse spatiotemporal features without increasing token numbers, while utilizing Token Dropping Routing to eliminate redundant features, and **2) Slow Path: Multi-granularity Keyframe Augmentation** optimizes keyframes during response generation. To provide comprehensive and detailed responses beyond atomic actions constrained by training data, fine-grained spatial features and human-environment interaction features are extracted through multi-granular pooling. They are further integrated into a meticulously designed multimodal Thinking Template to guide more precise response generation. Comprehensive evaluations on online video tasks demonstrate that LION-FS achieves state-of-the-art efficacy and efficiency. The codes will be released soon.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Li_2025_CVPR, author = {Li, Wei and Hu, Bing and Shao, Rui and Shen, Leyang and Nie, Liqiang}, title = {LION-FS: Fast \& Slow Video-Language Thinker as Online Video Assistant}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, pages = {3240-3251} }