-
[pdf]
[supp]
[bibtex]@InProceedings{Xiang_2026_CVPR, author = {Xiang, Wei and Wu, Yexinrui and Chen, Xinli and Li, Xinran and Chen, Shi}, title = {UI-Lens: Assessing General MLLMs' Potential to Automate UI Display Quality Assurance}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {25882-25892} }
UI-Lens: Assessing General MLLMs' Potential to Automate UI Display Quality Assurance
Abstract
User Interface (UI) display defect detection poses challenges far beyond UI understanding, requiring fine-grained element boundary understanding, missing-content detection, and reasoning about sequential interface semantic consistency. However, the capabilities of multimodal large language models (MLLMs) and vision-language models (VLMs) for detecting UI defects in realistic, complex interfaces have not been systematically validated. To fill this gap, we present UI-Lens, the first multi-dimensional UI display detection benchmark for Chinese-language UI scenarios. The dataset comprises 4,759 pages meticulously annotated by design experts, covering six core display defect categories. We conduct a systematic evaluation of 10 mainstream models (8 closed-source, 2 open-source). Results show clear shortcomings in current models: for tasks requiring fine-grained element boundary understanding, performance is near random, with task-average F1 scores of 20.36% and 31.21% on Text Overflow and Container Overlap, respectively; for sequential interface semantic consistency (e.g., Text Inconsistency), the task-average F1 score is only 10.61%, indicating severe underperformance. We release UI-Lens to catalyze research toward more robust UI display defect detection with fine-grained boundary awareness in realistic, complex interfaces.
Related Material

