InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy

Yang Tian, Yuyin Yang, Yiman Xie, Zetao Cai, Xu Shi, Ning Gao, Hangxu Liu, Xuekun Jiang, Zherui Qiu, Feng Yuan, Yaping Li, Ping Wang, Junhao Cai, Jia Zeng, Hao Dong, Jiangmiao Pang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 976-985

Abstract


Recent work explores how real and synthetic data contribute to VLA model generalization. While the \pi-series model has shown the strong effectiveness of large-scale real-robot pre-training, synthetic data has not previously demonstrated comparable capability at scale.This paper provides the first evidence that synthetic data alone can match the performance of the strongest \pi-dataset in pre-training a VLA model, revealing the substantial value of large-scale simulation.The resulting model also exhibits surprisingly strong zero-shot sim-to-real transfer on several challenging tasks.Our synthetic dataset, InternData-A1, contains over 630k trajectories and 7,433 hours across 4 embodiments, 18 skills, 70 tasks, and 227 scenes, covering rigid, articulated, deformable, and fluid-object manipulation. It is generated through a highly autonomous, fully decoupled, and compositional simulation pipeline that enables flexible task assembly, long-horizon skill composition, and heterogeneous embodiments with minimal manual tuning.Using the same architecture as \pi_0, we pre-train a model entirely on InternData-A1 and find that it matches the official \pi_0 across 49 simulation tasks, 5 real-world tasks, and 4 long-horizon dexterous tasks.We will open-source both the dataset and the generation pipeline to broaden access to large-scale robotic data and to lower the barrier to scalable data creation for embodied AI research.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Tian_2026_CVPR, author = {Tian, Yang and Yang, Yuyin and Xie, Yiman and Cai, Zetao and Shi, Xu and Gao, Ning and Liu, Hangxu and Jiang, Xuekun and Qiu, Zherui and Yuan, Feng and Li, Yaping and Wang, Ping and Cai, Junhao and Zeng, Jia and Dong, Hao and Pang, Jiangmiao}, title = {InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {976-985} }