MUFASA: A Multi-Layer Framework for Slot Attention

Bock, Sebastian; Schüßler, Leonie; Singh, Krishnakant; Schaub-Meyer, Simone; Roth, Stefan

Sebastian Bock, Leonie Schüßler, Krishnakant Singh, Simone Schaub-Meyer, Stefan Roth; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 27750-27760

Abstract

Unsupervised object-centric learning (OCL) decomposes visual scenes into distinct entities. Slot attention is a popular approach that represents individual objects as latent vectors, called slots. Current methods obtain these slot representations solely from the last layer of a pre-trained vision transformer (ViT), ignoring valuable, semantically rich information encoded across the other layers. To better utilize this latent semantic information, we introduce MUFASA, a lightweight plug-and-play framework for slot-attention-based approaches to unsupervised object segmentation. Our model computes slot attention across multiple feature layers of the ViT encoder, fully leveraging their semantic richness. We propose a fusion strategy to aggregate slots obtained on multiple layers into a unified object-centric representation. Integrating MUFASA into existing OCL methods improves their segmentation results across multiple datasets, setting a new state of the art while simultaneously improving training convergence with only minor inference overhead.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Bock_2026_CVPR, author = {Bock, Sebastian and Sch\"u{\ss}ler, Leonie and Singh, Krishnakant and Schaub-Meyer, Simone and Roth, Stefan}, title = {MUFASA: A Multi-Layer Framework for Slot Attention}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {27750-27760} }