-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Hosseini_2025_WACV, author = {Hosseini, Alireza and Kazerouni, Amirhossein and Akhavan, Saeed and Brudno, Michael and Taati, Babak}, title = {SUM: Saliency Unification through Mamba for Visual Attention Modeling}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {1597-1607} }
SUM: Saliency Unification through Mamba for Visual Attention Modeling
Abstract
Visual attention modeling important for interpreting and prioritizing visual stimuli plays a significant role in applications such as marketing multimedia and robotics. Traditional saliency prediction models especially those based on Convolutional Neural Networks (CNNs) or Transformers achieve notable success by leveraging large-scale annotated datasets. However the current state-of-the-art (SOTA) models that use Transformers are computationally expensive. Additionally separate models are often required for each image type lacking a unified approach. In this paper we propose Saliency Unification through Mamba (SUM) a novel approach that integrates the efficient long-range dependency modeling of Mamba with U-Net to provide a unified model for diverse image types. Using a novel Conditional Visual State Space (C-VSS) block SUM dynamically adapts to various image types including natural scenes web pages and commercial imagery ensuring universal applicability across different data types. Our comprehensive evaluations across five benchmarks demonstrate that SUM seamlessly adapts to different visual characteristics and consistently outperforms existing models. These results position SUM as a versatile and powerful tool for advancing visual attention modeling offering a robust solution universally applicable across different types of visual content. Our codebase and pretrained models are publicly accessible on the https://arhosseini77.github.io/sum_page/.
Related Material