-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Li_2026_CVPR, author = {Li, Junxuan and Khirodkar, Rawal and Zakharov, Egor and Lee, Jihyun and Su, Zhaoen and Dong, Yuan and Martinez, Julieta and Li, Kai and Tan, Qingyang and Shiratori, Takaaki and Hu, Matthew and Guo, Peihong and Huang, Xuhua and Jiang, Zhongshi and Yang, Lingchen and Zarei, Ariyan and Pesavento, Marco and Xu, Yichen and He, Chengan and Wen, He and Nam, Giljoo and Deng, Teng and Borsos, Wyatt and Thakrar, Anjali and Bazin, Jean-Charles and Abdrashitov, Rinat and Stoll, Carsten and Hidalgo, Gin\'es and Booth, James and Wang, Lucy and Ma, Xiaowen and Rong, Yu and Thalanki, Sairanjith and Cao, Chen and H\"ane, Christian and Kar, Abhishek and Bouaziz, Sofien and Saragih, Jason and Sheikh, Yaser and Saito, Shunsuke}, title = {Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {18204-18215} }
Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining
Abstract
High-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we present, for the first time, a pre/post-training paradigm for 3D avatar modeling at scale: we pretrain on 1M in-the-wild videos to learn broad priors over appearance and geometry, then post-train on high-quality curated data to enhance expressivity and fidelity. LCA generalizes across hair styles, clothing, and demographics while providing precise, fine-grained facial expressions and finger-level articulation control, with strong identity preservation. Notably, we observe emergent generalization to relightability and loose garment support to unconstrained inputs, and zero-shot robustness to stylized imagery, despite the absence of direct supervision.
Related Material

