-
[pdf]
[supp]
[bibtex]@InProceedings{Deb_2024_CVPR, author = {Deb, Tonmoay and Wang, Lichen and Bessinger, Zachary and Khosravan, Naji and Penner, Eric and Kang, Sing Bing}, title = {ZInD-Tell: Towards Translating Indoor Panoramas into Descriptions}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {2050-2059} }
ZInD-Tell: Towards Translating Indoor Panoramas into Descriptions
Abstract
This paper focuses on bridging the gap between natural language descriptions 360-degree panoramas room shapes and layouts/floorplans of indoor spaces. To enable new multi-modal (image geometry language) research directions in indoor environment understanding we propose a novel extension to the Zillow Indoor Dataset (ZInD) which we call ZInD-Tell. We first introduce an effective technique for extracting geometric information from ZInD's raw structural data which facilitates the generation of accurate ground truth descriptions using GPT-4. A human-in-the-loop approach is then employed to ensure the quality of these descriptions. To demonstrate the vast potential of our dataset we introduce the ZInD-Tell benchmark focusing on two exemplary tasks: language-based home retrieval and indoor description generation. Furthermore we propose an end-to-end zero-shot baseline model ZInD-Agent designed to process an unordered set of panorama images and generate home descriptions. ZInD-Agent outperforms naive methods in both tasks hence can be considered as a complement to the naive to show potential use of the data and impact of geometry. We believe this work initiates new trajectories in leveraging Computer Vision techniques to analyze indoor panorama images descriptively by learning the latent relation between vision geometry and language modalities.
Related Material