R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning

Zhang, Zirui; Dong, Haoyu; Pei, Kexin; Mao, Chengzhi

Zirui Zhang, Haoyu Dong, Kexin Pei, Chengzhi Mao; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 36893-36903

Abstract

Robust perception and reasoning require consistency across sensory modalities. Yet, current multimodal models often violate this principle, yielding contradictory predictions for visual versus textual representations of the same input. Rather than masking these failures with standard voting mechanisms--which amplify systematic biases--we demonstrate that cross-modal inconsistency provides a rich, natural signal for learning. We introduce R-C2, a reinforcement learning framework that resolves internal conflicts by enforcing cross-modal cycle consistency. By requiring a model to perform backward inference, switches modalities, and reliably reconstruct the answer via forward inference, we establish a dense, label-free reward. This cyclic constraint forces the model to autonomously align its representations. Optimizing for this structure mitigates modality-specific errors and improves reasoning accuracy by up to 7.6 points. Our results suggest that advanced reasoning emerges not just from scaling data, but from enforcing a structurally consistent understanding of the world.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Zhang_2026_CVPR, author = {Zhang, Zirui and Dong, Haoyu and Pei, Kexin and Mao, Chengzhi}, title = {R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {36893-36903} }