TrajTok: Learning Trajectory Tokens Enhances Video Understanding

Zheng, Chenhao; Zhang, Jieyu; Zhang, Jianing; Huang, Weikai; Kumar, Ashutosh; Kong, Quan; Tuzel, Oncel; Li, Chun-Liang; Krishna, Ranjay

Chenhao Zheng, Jieyu Zhang, Jianing Zhang, Weikai Huang, Ashutosh Kumar, Quan Kong, Oncel Tuzel, Chun-Liang Li, Ranjay Krishna; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 31207-31218

Abstract

Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens. This severely limits video efficiency and scalability. While the recent trajectory-based tokenizers offer a promising solution by decoupling video duration from token count, they rely on complex, external segmentation and tracking pipelines that are slow and task-agnostic. We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration. TrajTok contains a unified segmenter that performs implicit clustering over pixels in both space and time to directly produce object trajectories in a single forward pass. By prioritizing downstream adaptability over pixel-perfect segmentation fidelity, TrajTok is lightweight, efficient, and yet empirically improves video understanding performance. With TrajTok, we implement a video CLIP model trained from scratch (TrajViT2). It achieves the best accuracy at scale across both classification and retrieval benchmarks, while maintaining efficiency comparable to the best token-merging methods. TrajTok also proves to be a versatile component beyond its role as a tokenizer. We show that it can be seamlessly integrated as either a probing head for pretrained visual features (TrajAdapter) or an alignment connector in vision-language models (TrajVLM) with especially strong performance in long-video reasoning.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Zheng_2026_CVPR, author = {Zheng, Chenhao and Zhang, Jieyu and Zhang, Jianing and Huang, Weikai and Kumar, Ashutosh and Kong, Quan and Tuzel, Oncel and Li, Chun-Liang and Krishna, Ranjay}, title = {TrajTok: Learning Trajectory Tokens Enhances Video Understanding}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {31207-31218} }