-
[pdf]
[supp]
[bibtex]@InProceedings{Wu_2026_CVPR, author = {Wu, Rui and Zhang, Shuo and Tang, Xiaoxuan and Zhang, Ruirui and Liu, Yi and Jiang, Tao and Xu, Wenhao and Li, Yong}, title = {ReFAct: Empowering Multimodal Web Agents with Visual and Context Focusing}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {14990-14999} }
ReFAct: Empowering Multimodal Web Agents with Visual and Context Focusing
Abstract
Multimodal Web Search Agents demonstrate a practically valuable capability by fusing information from diverse modalities (e.g., text and vision), retrieved iteratively from the internet, to address complex user queries. However, the visual modality is prone to information overload, and the noise contained within it--such as irrelevant background details or complex structures--can disrupt the model's attention, misdirecting its operational focus toward an erroneous path. To address the aforementioned challenge, we propose ReFAct (Reasoning, Focusing, and Acting), a novel framework that empowers the agent to actively manage its cross-modal context. This allows the agent to adjust its operational focus, thereby mitigating the impact of noise on multimodal Web Search Agents. Specifically, ReFAct employs a Grounding tool for active visual perception to dynamically filter information. We also design external memory-based Defocus/Refocus operations for selective information retention, further modulating information density within the multimodal context. Ultimately, this ensures the agent maintains focus during problem-solving. To evaluate and enhance agent capabilities in complex and noisy multimodal contexts, we first propose a pipeline for constructing datasets with flexible complexity. We introduce a new open-source benchmark: GroundedVQA. Finally, we experimentally demonstrate the effectiveness of our proposed method on GroundedVQA and other widely-used benchmarks.
Related Material

