FastVLM: Efficient Vision Encoding for Vision Language Models

Vasu, Pavan Kumar Anasosalu; Faghri, Fartash; Li, Chun-Liang; Koc, Cem; True, Nate; Antony, Albert; Santhanam, Gokula; Gabriel, James; Grasch, Peter; Tuzel, Oncel; Pouransari, Hadi

Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokula Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 19769-19780

Abstract

Vision Language Models (VLMs) like LLaVA encode images into tokens aligned to the word embedding space of the LLM decoder. Scaling input image resolution is essential for improving performance, especially in text-rich image understanding tasks. However, popular visual encoders such as CLIP-pretrained ViTs become inefficient at high resolutions due to the large number of tokens and high encoding latency caused by stacked self-attention layers. At different operational resolutions, the vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens passed to the LLM, thereby lowering overall latency. In this work, we introduce FastVLM, which achieves an optimized trade-off between resolution, latency, and accuracy by incorporating FastViTHD--a new hybrid vision encoder that outputs fewer tokens and significantly reduces encoding time while processing high-resolution images. We provide a comprehensive efficiency analysis of the interplay between image resolution, vision latency, number of visual tokens, and LLM size. In the LLaVA-1.5 setup, we achieve 3.2x improvement in overall time-to-first-token (TTFT) while maintaining similar performance on VLM benchmarks compared to prior works. On text-rich evaluations like TextVQA and DocVQA, FastVLM obtains +8.4% and +12.5% better accuracy than ConvLLaVA at a similar operating point of 144 visual tokens. Compared to LLaVa-OneVision at the highest resolution (1152 x 1152), FastVLM achieves comparable performance on key benchmarks like SeedBench and MMMU, using the same LLM, but with 85x faster TTFT, 3x less vision instruction tuning data, and a vision encoder that is 3.4x smaller.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Vasu_2025_CVPR, author = {Vasu, Pavan Kumar Anasosalu and Faghri, Fartash and Li, Chun-Liang and Koc, Cem and True, Nate and Antony, Albert and Santhanam, Gokula and Gabriel, James and Grasch, Peter and Tuzel, Oncel and Pouransari, Hadi}, title = {FastVLM: Efficient Vision Encoding for Vision Language Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, pages = {19769-19780} }