Aligning Text, Images and 3D Structure Token-by-Token

Sahoo, Aadarsh; Tibrewal, Vansh; Gkioxari, Georgia

Aadarsh Sahoo, Vansh Tibrewal, Georgia Gkioxari; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 14905-14914

Abstract

Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed "cookbook" outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives, and more. We show how to tokenize complex 3D objects to incorporate into our structured 3D scene modality. We evaluate performance across four core 3D tasks - rendering, recognition, instruction-following, and question-answering - and four 3D datasets, synthetic and real-world. We show our model's effectiveness on reconstructing complete 3D scenes consisting of complex objects from a single image and on real-world 3D object recognition tasks. Project webpage: https://glab-caltech.github.io/kyvo/

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Sahoo_2026_CVPR, author = {Sahoo, Aadarsh and Tibrewal, Vansh and Gkioxari, Georgia}, title = {Aligning Text, Images and 3D Structure Token-by-Token}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {14905-14914} }