-
[pdf]
[bibtex]@InProceedings{Han_2025_CVPR, author = {Han, Yue and Zhang, Jiangning and Zhu, Junwei and Hou, Runze and Ji, Xiaozhong and Lin, Chuming and Hu, Xiaobin and Xue, Zhucun and Liu, Yong}, title = {GroundingFace: Fine-grained Face Understanding via Pixel Grounding Multimodal Large Language Model}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {3942-3951} }
GroundingFace: Fine-grained Face Understanding via Pixel Grounding Multimodal Large Language Model
Abstract
Multimodal Language Learning Models (MLLMs) have shown remarkable performance in image understanding, generation, and editing, with recent advancements achieving pixel-level grounding with reasoning. However, these models for common objects struggle with fine-grained face understanding. In this work, we introduce the FacePlayGround-240K dataset, the first pioneering large-scale, pixel-grounded face caption and question-answer (QA) dataset, meticulously curated for alignment pretraining and instruction-tuning. We present the GroundingFace framework, specifically designed to enhance fine-grained face understanding. This framework significantly augments the capabilities of existing grounding models in face part segmentation, face attribute comprehension, while preserving general scene understanding. Comprehensive experiments validate that our approach surpasses current state-of-the-art models in pixel-grounded face captioning/QA and various downstream tasks, including face captioning, referring segmentation, and zero-shot face attribute recognition.
Related Material