Seeing What Matters: A Training-Free Self-Guided Framework for Multimodal Detail Perception and Reasoning

Mingjie Ma, yichao ma, Zhong Yang, Guohui Li; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 8727-8736

Abstract


Multimodal large language models (MLLMs) have achieved remarkable success on diverse visual-language tasks. However, fixed-resolution models face challenges in perceiving fine-grained visual details, particularly due to *distracted attention* and *blurry vision*. To address these issues, we propose **SLoFo**, a training-free and self-guided inference framework that mimics the human "**S**can-**Lo**cate-**Fo**cus" process. SLoFo first adopts a dual-branch mechanism to identify critical image regions: the Semantic branch constructs a gradient-based semantic relevance map, and the Structure branch estimates visual token uniqueness offering complementary and robust evidence. By combining both branches, SLoFo perceives and explicitly crop critical regions. During inference, with additional cropped sub-image, SLoFo applies a progressive visual token pruning strategy to improve attention focus on key areas while reducing computational overhead. Experiments on detail-sensitive and general-purpose benchmarks show that SLoFo consistently improves accuracy (+4.79% on TextVQA, +2.58% on GQA) and robustness (+4.60% on POPE-MSCOCO adversarial) without training or external modules.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Ma_2026_CVPR, author = {Ma, Mingjie and ma, yichao and Yang, Zhong and Li, Guohui}, title = {Seeing What Matters: A Training-Free Self-Guided Framework for Multimodal Detail Perception and Reasoning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {8727-8736} }