ReMoT: Reinforcement Learning with Motion Contrast Triplets

Cong Wan, Zeyu Guo, Jiangyang Li, Songlin Dong, Yifan Bai, Lin Peng, Zhiheng Ma, Yihong Gong; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 5487-5498

Abstract


We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency--a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (i) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (ii) Group Relative Policy Optimization, which we empirically validate, yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves SOTA performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1 performance leap on spatio-temporal reasoning tasks.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Wan_2026_CVPR, author = {Wan, Cong and Guo, Zeyu and Li, Jiangyang and Dong, Songlin and Bai, Yifan and Peng, Lin and Ma, Zhiheng and Gong, Yihong}, title = {ReMoT: Reinforcement Learning with Motion Contrast Triplets}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {5487-5498} }