-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Lin_2025_CVPR, author = {Lin, Jing and Feng, Yao and Liu, Weiyang and Black, Michael J.}, title = {ChatHuman: Chatting about 3D Humans with Tools}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, pages = {8150-8161} }
ChatHuman: Chatting about 3D Humans with Tools
Abstract
Numerous methods have been proposed to detect, estimate, and analyze properties of people in images, including the estimation of 3D pose, shape, contact, human-object interaction, emotion, and more. While widely applicable in vision and other areas, such methods require expert knowledge to select, use, and interpret the results. To address this, we introduce ChatHuman, a language-driven system that combines and integrates the skills of these specialized methods. ChatHuman functions as an assistant proficient in utilizing, analyzing, and interacting with tools specific to 3D human tasks, adeptly discussing and resolving related challenges. Built on a Large Language Model (LLM) framework, ChatHuman is trained to autonomously select, apply, and interpret a diverse set of tools in response to user inputs. Adapting LLMs to 3D human tasks presents challenges, including the need for domain-specific knowledge and the ability to interpret complex 3D outputs. The innovative features of ChatHuman include leveraging academic publications to instruct the LLM how to use the tools, employing a retrieval-augmented generation model to create in-context learning examples for managing new tools, and effectively discriminating and integrating tool results to enhance tasks related to 3D humans by transforming specialized 3D outputs into comprehensible formats. Our experiments demonstrate that ChatHuman surpasses existing models in both tool selection accuracy and overall performance across various 3D human tasks, and it supports interactive chatting with users. ChatHuman represents a significant step toward consolidating diverse analytical methods into a unified, robust system for 3D human tasks.
Related Material