MOOSS: Mask-Enhanced Temporal Contrastive Learning for Smooth State Evolution in Visual Reinforcement Learning

Jiarui Sun, M. Ugur Akcal, Girish Chowdhary, Wei Zhang; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 6719-6729

Abstract


In visual Reinforcement Learning (RL) learning from pixel-based observations poses significant challenges on sample efficiency primarily due to the complexity of extracting informative state representations from high-dimensional data. Previous methods such as contrastive-based approaches have made strides in improving sample efficiency but fall short in modeling the nuanced evolution of states. To address this we introduce MOOSS a novel framework that leverages a temporal contrastive objective with the help of graph-based spatial-temporal masking to explicitly model state evolution in visual RL. Specifically we propose a self-supervised dual-component strategy that integrates (1) a graph construction of pixel-based observations for spatial-temporal masking coupled with (2) a multi-level contrastive learning mechanism that enriches state representations by emphasizing temporal continuity and change of states. MOOSS advances the understanding of state dynamics by disrupting and learning from spatial-temporal correlations which facilitates policy learning. Our comprehensive evaluation on multiple continuous and discrete control benchmarks shows that MOOSS outperforms previous state-of-the-art visual RL methods in terms of sample efficiency demonstrating the effectiveness of our method.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Sun_2025_WACV, author = {Sun, Jiarui and Akcal, M. Ugur and Chowdhary, Girish and Zhang, Wei}, title = {MOOSS: Mask-Enhanced Temporal Contrastive Learning for Smooth State Evolution in Visual Reinforcement Learning}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {6719-6729} }