SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead

Ni, Chaojun; Chen, Cheng; Wang, Xiaofeng; Zhu, Zheng; Zheng, Wenzhao; Wang, Boyuan; Chen, Tianrun; Zhao, Guosheng; Li, Haoyun; Dong, Zhehao; Zhang, Qiang; Ye, Yun; Wang, Yang; Huang, Guan; Mei, Wenjun

Chaojun Ni, Cheng Chen, Xiaofeng Wang, Zheng Zhu, Wenzhao Zheng, Boyuan Wang, Tianrun Chen, Guosheng Zhao, Haoyun Li, Zhehao Dong, Qiang Zhang, Yun Ye, Yang Wang, Guan Huang, Wenjun Mei; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 13474-13485

Abstract

Vision-Language-Action (VLA) models built on pretrained Vision-Language Models (VLMs) show strong potential but are limited in practicality due to their large parameter counts. To mitigate this issue, using a lightweight VLM has been explored, but it compromises spatiotemporal reasoning. Although some methods suggest that incorporating additional 3D inputs can help, they usually rely on large VLMs to fuse 3D and 2D inputs and still lack temporal understanding. Therefore, we propose SwiftVLA, an architecture that enhances a compact model with 4D understanding while preserving design efficiency. Specifically, our approach features a pretrained 4D visual geometry transformer with a temporal cache that incrementally extracts 4D features from 2D images. Then, to enhance the VLM's ability to exploit both 2D images and 4D features, we introduce Fusion Tokens, a set of learnable tokens trained with a future prediction objective to generate unified representations for action generation. Finally, we introduce a mask-and-reconstruct strategy that randomly masks 4D inputs to the VLM and trains the VLA to reconstruct the masked features. This self-reconstruction objective helps learn effective 4D representations, allowing the 4D branch to be dropped at inference with minimal performance loss. Extensive experiments in real and simulated environments show that SwiftVLA outperforms lightweight baselines and rivals VLAs up to 7xlarger. On edge devices, SwiftVLA achieves comparable performance while being 18xfaster than the \pi_0 and reducing the memory footprint by 12x.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Ni_2026_CVPR, author = {Ni, Chaojun and Chen, Cheng and Wang, Xiaofeng and Zhu, Zheng and Zheng, Wenzhao and Wang, Boyuan and Chen, Tianrun and Zhao, Guosheng and Li, Haoyun and Dong, Zhehao and Zhang, Qiang and Ye, Yun and Wang, Yang and Huang, Guan and Mei, Wenjun}, title = {SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {13474-13485} }