Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining

Junxuan Li, Rawal Khirodkar, Egor Zakharov, Jihyun Lee, Zhaoen Su, Yuan Dong, Julieta Martinez, Kai Li, Qingyang Tan, Takaaki Shiratori, Matthew Hu, Peihong Guo, Xuhua Huang, Zhongshi Jiang, Lingchen Yang, Ariyan Zarei, Marco Pesavento, Yichen Xu, Chengan He, He Wen, Giljoo Nam, Teng Deng, Wyatt Borsos, Anjali Thakrar, Jean-Charles Bazin, Rinat Abdrashitov, Carsten Stoll, Ginés Hidalgo, James Booth, Lucy Wang, Xiaowen Ma, Yu Rong, Sairanjith Thalanki, Chen Cao, Christian Häne, Abhishek Kar, Sofien Bouaziz, Jason Saragih, Yaser Sheikh, Shunsuke Saito; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 18204-18215

Abstract


High-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we present, for the first time, a pre/post-training paradigm for 3D avatar modeling at scale: we pretrain on 1M in-the-wild videos to learn broad priors over appearance and geometry, then post-train on high-quality curated data to enhance expressivity and fidelity. LCA generalizes across hair styles, clothing, and demographics while providing precise, fine-grained facial expressions and finger-level articulation control, with strong identity preservation. Notably, we observe emergent generalization to relightability and loose garment support to unconstrained inputs, and zero-shot robustness to stylized imagery, despite the absence of direct supervision.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Li_2026_CVPR, author = {Li, Junxuan and Khirodkar, Rawal and Zakharov, Egor and Lee, Jihyun and Su, Zhaoen and Dong, Yuan and Martinez, Julieta and Li, Kai and Tan, Qingyang and Shiratori, Takaaki and Hu, Matthew and Guo, Peihong and Huang, Xuhua and Jiang, Zhongshi and Yang, Lingchen and Zarei, Ariyan and Pesavento, Marco and Xu, Yichen and He, Chengan and Wen, He and Nam, Giljoo and Deng, Teng and Borsos, Wyatt and Thakrar, Anjali and Bazin, Jean-Charles and Abdrashitov, Rinat and Stoll, Carsten and Hidalgo, Gin\'es and Booth, James and Wang, Lucy and Ma, Xiaowen and Rong, Yu and Thalanki, Sairanjith and Cao, Chen and H\"ane, Christian and Kar, Abhishek and Bouaziz, Sofien and Saragih, Jason and Sheikh, Yaser and Saito, Shunsuke}, title = {Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {18204-18215} }