ProSoftArena: Benchmarking Hierarchical Capabilities of Multi-modal Agents in Professional Software Environments

Jiaxin Ai, Yukang Feng, Fanrui Zhang, Jianwen Sun, Zizhen Li, Chuanhao Li, Yifan Chang, Wenxiao Wu, Ruoxi Wang, Mingliang Zhai, Kaipeng Zhang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 34586-34595

Abstract


Multi-modal agents are making rapid progress on general computer-use tasks. However, existing benchmarks remain largely confined to web browsers and rudimentary applications, failing to capture the professional software workflows that dominate real-world scientific and industrial practices. To bridge this gap, we introduce ProSoftArena, a comprehensive benchmark and platform specifically designed for evaluating multi-modal agents in professional software environments. We establish the first five-level capability hierarchy for professional software manipulation, and curate a benchmark of 456 realistic tasks spanning 6 disciplines and 13 core professional applications. To ensure reliable assessment, we build an executable real-computer environment with an execution-based evaluation framework, and uniquely incorporate a human-in-the-loop evaluation paradigm to quantify agents' collaborative efficiency. Extensive experiments show that even the best-performing agent achieves only a 20.6% success rate on software-level tasks (L2) and completely fails on multi-software workflows (L3). Our in-depth analysis further provides valuable insights in current agent limitations and suggests effective design principles for building more capable agents in professional software settings. This project is available at: https://prosoftarena.github.io.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Ai_2026_CVPR, author = {Ai, Jiaxin and Feng, Yukang and Zhang, Fanrui and Sun, Jianwen and Li, Zizhen and Li, Chuanhao and Chang, Yifan and Wu, Wenxiao and Wang, Ruoxi and Zhai, Mingliang and Zhang, Kaipeng}, title = {ProSoftArena: Benchmarking Hierarchical Capabilities of Multi-modal Agents in Professional Software Environments}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {34586-34595} }