-
[pdf]
[supp]
[bibtex]@InProceedings{Huang_2026_CVPR, author = {Huang, Zeyi and Ji, Yuyang and Rajan, Anirudh Sundara and Cai, Zefan and Xiao, Wen and Wang, Haohan and Hu, Junjie and Lee, Yong Jae}, title = {Learning to Select Visual Tools from Experience}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {4783-4793} }
Learning to Select Visual Tools from Experience
Abstract
We introduce VisualToolAgent (VisTA), a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and compose tools from a diverse library based on empirical performance. Existing methods for tool-augmented visual reasoning either rely on training-free prompting or large-scale supervised fine-tuning; both lack active tool exploration and typically assume limited tool diversity, and fine-tuning methods additionally demand extensive human supervision. In contrast, VisTA leverages end-to-end reinforcement learning to iteratively refine sophisticated, query-specific tool selection strategies, guided solely by task outcomes. Leveraging reinforcement learning with verifiable rewards (RLVR), our framework enables an agent to autonomously discover effective tool-selection pathways without requiring explicit reasoning supervision. Experiments on the ChartQA, Geometry3K, MathVerse, and BlindTest benchmarks demonstrate that VisTA achieves significant performance gains over training-free and fine-tuning baselines, especially on out-of-distribution examples. These results highlight VisTA's ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems.
Related Material

