CogAgent: A Visual Language Model for GUI Agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, Jie Tang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14281-14290

Abstract


People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs) e.g. computer or smartphone screens. Large language models (LLMs) such as ChatGPT can assist people in tasks like writing emails but struggle to understand and interact with GUIs thus limiting their potential to increase automation levels. In this paper we introduce CogAgent an 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders CogAgent supports input at a resolution of 1120*1120 enabling it to recognize tiny page elements and text. As a generalist visual language model CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks including VQAv2 OK-VQA Text-VQA ST-VQA ChartQA infoVQA DocVQA MM-Vet and POPE. CogAgent using only screenshots as input outperforms LLM-based methods that consume extracted HTML text on both PC and Android GUI navigation tasks---Mind2Web and AITW advancing the state of the art. The model and codes are available at https://github.com/THUDM/CogVLM.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Hong_2024_CVPR, author = {Hong, Wenyi and Wang, Weihan and Lv, Qingsong and Xu, Jiazheng and Yu, Wenmeng and Ji, Junhui and Wang, Yan and Wang, Zihan and Dong, Yuxiao and Ding, Ming and Tang, Jie}, title = {CogAgent: A Visual Language Model for GUI Agents}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {14281-14290} }