-
[pdf]
[bibtex]@InProceedings{Srivastava_2025_ICCV, author = {Srivastava, Sarthak and Wu, Kathy}, title = {HyperVLM: Hyperbolic Space Guided Vision Language Modeling for Hierarchical Multi-Modal Understanding}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {2368-2379} }
HyperVLM: Hyperbolic Space Guided Vision Language Modeling for Hierarchical Multi-Modal Understanding
Abstract
State-of-the-art performance has been achieved in recent years on tasks such as product search, recommendation, and classification using visuo-lingual multimodal models. While pretrained vision-language models like CLIP have shown strong zero-shot capabilities by aligning vision and language in a shared space, they often fail to capture the natural hierarchical relationships common in real-world retail data. In this work, we propose HyperVLM: a vision-language model built on hyperbolic Poincare geometry that learns joint image-text representations while explicitly modeling their hierarchical structure. We compare HyperVLM with CLIP on zero-shot image classification and retrieval tasks, highlighting its improved performance on tasks involving fine-grained category distinctions--critical in large-scale retail environments. We also integrate our method into BLIP's ITC loss module, showing enhanced retrieval accuracy. Our proposed approach holds immense value for recommendation and search systems in retail, where understanding complex product relationships and scalable retrieval is essential.
Related Material
