-
[pdf]
[supp]
[bibtex]@InProceedings{Xiu_2026_CVPR, author = {Xiu, Jingqiao and Wang, Can and Xu, Dong}, title = {MLLMSplat: A 2D MLLM-Powered Framework for 3D Gaussian Splatting Understanding, Generation, and Editing}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {33301-33311} }
MLLMSplat: A 2D MLLM-Powered Framework for 3D Gaussian Splatting Understanding, Generation, and Editing
Abstract
3D Gaussian Splatting (3DGS) has emerged as a mainstream representation for 3D scenes, drawing increasing research attention to its understanding, generation, and editing. However, existing studies remain limited to low-level perception, low-quality generation, and low-efficiency editing, lagging far behind their image counterparts in the era of Multimodal Large Language Models (MLLMs). To bridge this gap, we propose MLLMSplat, a novel framework that adapts 2D MLLMs to achieve high-level understanding, high-quality generation, and high-efficiency editing of 3DGS scenes. Specifically, our comprehensive framework consists of three core designs: (1) a 3DGS tokenizer that can be seamlessly integrated into MLLMs in a training-free manner; (2) a 3DGS de-tokenizer that non-intrusively extends the 2D latent diffusion model in MLLMs using a dual rotary positional encoding space, while augmenting it with a jointly trained and sampled 3DGS decoder; and (3) a surrogate task that enhances feed-forward editing capabilities. Extensive experiments demonstrate that MLLMSplat delivers state-of-the-art performance across 3DGS understanding, generation, and editing.
Related Material

