Chrono: A Simple Blueprint for Representing Time in MLLMs

Boris Meinardus, Hector G. Rodriguez, Anil Batra, Anna Rohrbach, Marcus Rohrbach; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025, pp. 4151-4156

Abstract


The recent success of Large Language Models (LLMs) has prompted the extension to the multimodal domain, developing image-text Multimodal LLMs (MLLMs) and then video-text models. In this work, we investigate the challenge of contextual and temporal comprehension in video-language models by exploring the task of temporal localization in videos. To address this problem, prior works have developed complex task-specific architectures, novel modules to embed time into MLLMs, or leveraged additional input signals such as video transcripts to best encode contextual and temporal information. Interestingly, we find that most of these efforts are surpassed by a much simpler design. We introduce Chrono, a universal sequence blueprint that can be applied to an image-text pretrained MLLM. Through extensive ablations across different MLLM architectures, finetuning and zero-shot settings, and different datasets, we achieve a new SOTA in moment retrieval on the most widely used benchmarks Charades-STA, QVHighlights, ActivityNet Captions, and grounded video question answering on NeXT-GQA. Our code will be made publicly available.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Meinardus_2025_ICCV, author = {Meinardus, Boris and Rodriguez, Hector G. and Batra, Anil and Rohrbach, Anna and Rohrbach, Marcus}, title = {Chrono: A Simple Blueprint for Representing Time in MLLMs}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {4151-4156} }