-
[pdf]
[supp]
[bibtex]@InProceedings{Tian_2025_CVPR, author = {Tian, Jirui and Zhang, Jinrong and Liu, Shenglan and Xu, Luhao and Huang, Zhixiong and Huang, Gao}, title = {DTOS: Dynamic Time Object Sensing with Large Multimodal Model}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {13810-13820} }
DTOS: Dynamic Time Object Sensing with Large Multimodal Model
Abstract
Existing multimodal large language models (MLLMs) face significant challenges in Referring Video Object Segmentation(RVOS). We identify three critical challenges: (C1) insufficient quantitative representation of textual numerical data, (C2) repetitive and degraded response templates for spatiotemporal referencing, and (C3) loss of visual information in video sampling queries lacking textual guidance. To address these, we propose a novel framework, Dynamic Time Object Sensing (DTOS), specifically designed for RVOS. To tackle (C1) and (C2), we introduce specialized tokens to construct multi-answer response templates, enabling regression of event boundaries and target localization. This approach improves the accuracy of numerical regression while mitigating the issue of repetitive degradation. To address (C3), we propose a Text-guided Clip Sampler (TCS) that selects video clips aligned with user instructions, preventing visual information loss and ensuring consistent temporal resolution. TCS is also applicable to Moment Retrieval tasks, with enhanced multimodal input sequences preserving spatial details and maximizing temporal resolution. DTOS demonstrates exceptional capability in flexibly localizing multiple spatiotemporal targets based on user-provided textual instructions. Extensive experiments validate the effectiveness of our approach, with DTOS achieving state-of-the-art performance in J&F scores: an improvement of +4.36 on MeViS, +4.48 on Ref-DAVIS17, and +3.02 on Ref-YT-VOS. Additionally, our TCS demonstrates exceptional performance in Moment Retrieval. The code is available at https://github.com/Maulog/OPEN-DTOS-LMM
Related Material