VideoOrion: Tokenizing Object Dynamics in Videos

Feng, Yicheng; Li, Yijiang; Zhang, Wanpeng; Zheng, Sipeng; Luo, Hao; Yue, Zihao; Lu, Zongqing

Yicheng Feng, Yijiang Li, Wanpeng Zhang, Sipeng Zheng, Hao Luo, Zihao Yue, Zongqing Lu; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 20401-20412

Abstract

We present VideoOrion, a Video Large Language Model (Video-LLM) that explicitly captures the key semantic information in videos--the spatial-temporal dynamics of objects throughout the videos. VideoOrion employs expert vision models to extract object dynamics through a detect-segment-track pipeline, encoding them into a set of object tokens by aggregating spatial-temporal object features. Our method addresses the persistent challenge in Video-LLMs of efficiently compressing high-dimensional video data into semantic tokens that are comprehensible to LLMs. Compared to prior methods which resort to downsampling the original video or aggregating visual tokens using resamplers, leading to information loss and entangled semantics, VideoOrion not only offers a more natural and efficient way to derive compact, disentangled semantic representations but also enables explicit object modeling of video content with minimal computational cost. Moreover, the introduced object tokens naturally allow VideoOrion to accomplish video-based referring tasks. Experimental results show that VideoOrion can learn to make good use of the object tokens, and achieves competitive results on both general video question answering and video-based referring benchmarks.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Feng_2025_ICCV, author = {Feng, Yicheng and Li, Yijiang and Zhang, Wanpeng and Zheng, Sipeng and Luo, Hao and Yue, Zihao and Lu, Zongqing}, title = {VideoOrion: Tokenizing Object Dynamics in Videos}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {20401-20412} }