# ZonUI-3B
ZonUI-3B — A lightweight GUI grounding model optimized for high-resolution screens, trained with just 24K examples on a single RTX 4090.

![Training Flow](assets/sota_perf_and_rader_compare.jpg)
<!-- ![Training Flow](assets/training_flow_solid.jpg) -->

## Guide
- Inference guide: ./inference/inference.ipynb
- evaluation guide: ./EVALUATION.md (evaluate ScreenSpot and ScreenSpot-v2 in about 30 minutes on an RTX 4090, and ScreenSpot-Pro in about 1 hour)
- reproduce guide: ./TRAIN.md

## 🖥️ Hardware
- GPU: 1 × RTX 4090 24GB
- Time: <= 48 hrs

## Main Results

### ScreenSpot

| Grounding Model          | Avg Score  | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon |
|--------------------------|--------|-------------|-------------|---------------|----------------|-----------|-----------|
| **General Models**       |        |             |             |               |                |           |           |
| Qwen2.5-VL-3B            | 55.5   | -           | -           | -             | -              | -         | -         |
| InternVL3-8B             | 79.5   | -           | -           | -             | -              | -         | -         |
| Claude3.5 Sonnet         | 83.0   | -           | -           | -             | -              | -         | -         |
| Gemini-2 Flash           | 84.0   | -           | -           | -             | -              | -         | -         |
| Qwen2.5-VL-7B            | 84.7   | -           | -           | -             | -              | -         | -         |
| **GUI-specific Models**  |        |             |             |               |                |           |           |
| CogAgent-18B             | 47.4   | 67.0        | 24.0        | 74.2          | 20.0           | 70.4      | 28.6      |
| SeeClick-9.6B            | 53.4   | 78.0        | 52.0        | 72.2          | 30.0           | 55.7      | 32.5      |
| OmniParser               | 73.0   | 93.9        | 57.0        | 91.3          | 63.6           | 81.3      | 51.0      |
| UGround-7B               | 73.3   | 82.8        | 60.3        | 82.5          | 63.6           | 80.4      | 70.4      |
| ShowUI-2B                | 75.0   | 91.6        | 69.0        | 81.8          | 59.0           | 83.0      | 65.5      |
| UI-TARS-2B               | 82.3   | 93.0        | 75.5        | 90.7          | 68.6           | 84.3      | 74.8      |
| OS-Atlas-7B              | 82.5   | 93.0        | 72.9        | 91.8          | 62.9           | 90.9      | 74.3      |
| Aguvis-7B                | 84.4   | 95.6        | 77.7        | 93.8          | 67.1           | 88.3      | 75.2      |
| **ZonUI-3B**          | **84.9** | **96.3**    | **81.6**    | **93.8**      | **74.2**       | 89.5      | 74.2      |


### ScreenSpot-v2

| Grounding Model          | Avg Score  | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon |
|--------------------------|--------|-------------|-------------|---------------|----------------|-----------|-----------|
| **General Models**       |        |             |             |               |                |           |           |
| InternVL3-8B             | 81.4   | -           | -           | -             | -              | -         | -         |
| **GUI-specific Models**  |        |             |             |               |                |           |           |
| SeeClick-9.6B            | 55.1   | 78.4        | 50.7        | 70.1          | 29.3           | 55.2      | 32.5      |
| UGround-7B               | 76.3   | 84.5        | 61.6        | 85.1          | 61.4           | 84.6      | 71.9      |
| ShowUI-2B                | 77.3   | 92.1        | 75.4        | 78.9          | 59.3           | 84.2      | 61.1      |
| OS-Atlas-7B              | 84.1   | 95.1        | 75.8        | 90.7          | 63.5           | 90.6      | 77.3      |
| UI-TARS-2B               | 84.7   | 95.2        | 79.1        | 90.7          | 68.6           | 87.2      | 78.3      |
| **ZonUI-3B**        | **86.4** | **97.9**    | **84.8**    | **93.8**      | **75.0**       | **91.0**  | 75.8      |

### ScreenSpot-pro
| Agent Model             | Dev Avg | Creative Avg | CAD Avg | Scientific Avg | Office Avg | OS Avg | Overall Avg |
|-------------------------|---------|--------------|---------|----------------|------------|--------|--------------|
| QwenVL-7B               | 0.0     | 0.0          | 0.0     | 0.4            | 0.0        | 0.0    | 0.1          |
| GPT-4o                  | 0.7     | 0.6          | 1.5     | 1.2            | 0.9        | 0.0    | 0.8          |
| SeeClick                | 0.3     | 0.6          | 1.9     | 2.0            | 0.9        | 1.5    | 1.1          |
| Qwen2-VL-7B             | 1.3     | 0.9          | 0.4     | 3.5            | 3.0        | 0.5    | 1.6          |
| OS-Atlas-4B             | 3.7     | 2.3          | 1.5     | 7.5            | 4.8        | 3.1    | 3.7          |
| ShowUI-2B               | 9.4     | 5.3          | 1.9     | 10.6           | 13.5       | 6.6    | 7.7          |
| CogAgent-18B            | 8.0     | 5.6          | 6.1     | 13.4           | 10.0       | 3.1    | 7.7          |
| Aria-UI                 | 8.4     | 14.7         | 6.1     | 18.1           | 16.1       | 2.6    | 11.3         |
| UGround-7B              | 14.7    | 17.0         | 11.1    | 19.3           | 27.0       | 9.7    | 16.5         |
| Claude Computer Use     | 12.6    | 16.8         | 11.9    | 25.8           | 26.9       | 8.1    | 17.1         |
| OS-Atlas-7B             | 17.7    | 17.9         | 10.3    | 24.4           | 27.4       | 16.8   | 18.9         |
| UGround-V1-2B           | 27.4    | 26.7         | 14.6    | 34.3           | 38.3       | 17.9   | 26.6         |
| Qwen2.5-VL-7B           | 26.1    | 24.0         | 13.0    | 31.1           | 45.2       | 23.5   | 26.8         |
| UI-TARS-2B              | 26.4    | 27.6         | 14.6    | 39.8           | 42.6       | 14.3   | 27.7         |
| **ZonUI-3B (Ours)**     | 15.7    | 26.9         | 27.9    | 38.9           | 50.0       | 14.2   | **28.7**     |



