-
[pdf]
[bibtex]@InProceedings{Jia_2026_CVPR, author = {Jia, Sen and Wang, Huayu and Huang, Hsiang-Wei and An, Zhaochong and Hwang, Jenq-Neng and Zhang, Huaping and Li, Lei}, title = {CLEP: Contrastive Language-Pose Pretraining}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {30696-30706} }
CLEP: Contrastive Language-Pose Pretraining
Abstract
Aligning natural language descriptions with precise 3D human poses remains a big challenge due to the scarcity of effective pose representation mechanisms and large-scale, semantically rich datasets. To overcome these limitations, we first introduce **CLEP-2M**, the largest 3D pose-language dataset to date, comprising two million high-quality 3D pose-language pairs. This dataset provides a **20-fold** increase in scale and far richer semantic diversity than existing benchmarks. Second, we propose **CLEP**, a novel contrastive pretraining framework. The core of CLEP is HierFormer, a hierarchical pose encoder specifically designed for language alignment. Its key innovation is a Cross-Scale Attention Fusion (CSAF) mechanism that dynamically integrates features from the joint, limb, and body levels. This enables CLEP to precisely align complex, multi-scale text descriptions with the pose representation. Extensive experimental evaluations on CLEP-2M and PoseScript demonstrate that our method consistently outperforms existing approaches across a range of downstream tasks. CLEP shows exceptional zero-shot generalization, achieving a 34.8 mRecall on the human-annotated PoseScript-H benchmark--a nearly **6-fold** improvement from the baseline. Furthermore, CLEP demonstrates superior performance on pose generation and fine-grained pose editing. These results establish CLEP as a strong multimodal foundation model for human-centric understanding and generation tasks.
Related Material

