-
[pdf]
[supp]
[bibtex]@InProceedings{Ma_2026_CVPR, author = {Ma, Mingjie and ma, yichao and Yang, Zhong and Li, Guohui}, title = {Seeing What Matters: A Training-Free Self-Guided Framework for Multimodal Detail Perception and Reasoning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {8727-8736} }
Seeing What Matters: A Training-Free Self-Guided Framework for Multimodal Detail Perception and Reasoning
Abstract
Multimodal large language models (MLLMs) have achieved remarkable success on diverse visual-language tasks. However, fixed-resolution models face challenges in perceiving fine-grained visual details, particularly due to *distracted attention* and *blurry vision*. To address these issues, we propose **SLoFo**, a training-free and self-guided inference framework that mimics the human "**S**can-**Lo**cate-**Fo**cus" process. SLoFo first adopts a dual-branch mechanism to identify critical image regions: the Semantic branch constructs a gradient-based semantic relevance map, and the Structure branch estimates visual token uniqueness offering complementary and robust evidence. By combining both branches, SLoFo perceives and explicitly crop critical regions. During inference, with additional cropped sub-image, SLoFo applies a progressive visual token pruning strategy to improve attention focus on key areas while reducing computational overhead. Experiments on detail-sensitive and general-purpose benchmarks show that SLoFo consistently improves accuracy (+4.79% on TextVQA, +2.58% on GQA) and robustness (+4.60% on POPE-MSCOCO adversarial) without training or external modules.
Related Material

