PoseSynViT: Lightweight and Scalable Vision Transformers for Human Pose Estimation

Jamil, Sonain

Sonain Jamil; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops, 2025, pp. 3912-3921

Abstract

Vision transformers (ViTs) have consistently delivered outstanding results in visual recognition tasks without needing specialized domain knowledge. Nevertheless, their application in human pose estimation (HPE) tasks remains underexplored. This paper introduces PoseSynViT, a new lightweight ViT model that surpasses ViTPose in several areas, including simplicity of model architecture, scalability, training versatility, and ease of knowledge transfer. Our model uses ViTs as backbones to extract features for HPE and integrates a lightweight decoder. It scales efficiently from 10M to 1B parameters, taking advantage of the inherent scalability and high parallelism of transformers, setting a new benchmark between throughput and performance. PoseSynViT is highly adaptable, supporting various attention mechanisms, input resolutions, and training approaches, and is capable of handling multiple HPE tasks. Additionally, we demonstrate that knowledge from larger models can be seamlessly transferred to smaller ones through a straightforward knowledge token. Experimental results on the MS COCO benchmark show that PoseSynViT outperforms current methods, with our largest model setting a new state-of-the-art performance of 84.3 AP on the MS COCO test dataset.

Related Material

[pdf]

[bibtex]

@InProceedings{Jamil_2025_CVPR, author = {Jamil, Sonain}, title = {PoseSynViT: Lightweight and Scalable Vision Transformers for Human Pose Estimation}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops}, month = {June}, year = {2025}, pages = {3912-3921} }