-
[pdf]
[supp]
[bibtex]@InProceedings{Vidal_2026_CVPR, author = {Vidal, \`Alex Pujol and Escalera, Sergio and Nasrollahi, Kamal and Moeslund, Thomas B.}, title = {Fine-tuned Hyperbolic CLIP Models are Good Video Learners}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2026}, pages = {7445-7453} }
Fine-tuned Hyperbolic CLIP Models are Good Video Learners
Abstract
Hyperbolic geometry captures visual-semantic hierarchy effectively for images, yet whether this geometric property extends to video, where temporal dynamics add complexity, remains an open question. To study this, we integrate our method with video-language models and present the first systematic comparison of Euclidean and hyperbolic for videos models in zero-shot action recognition settings. We evaluate multiple freezing strategies and temporal aggregation methods across three standard benchmarks. Beyond aggregation accuracy, we introduce a per-class diagnostic analysis that reveals when hyperbolic geometry helps and validate these findings geometrically. We observe that hyperbolic features exhibit stronger entailment cone structure, hierarchically organized accuracy gains, and semantically closer misclassifications, confirming that the learned representations capture genuinely hierarchical organization. Our results show that hyperbolic geometry yields consistent improvements across all benchmarks, establishing both a strong baseline and a diagnostic framework for future work on hyperbolic video-language learning.
Related Material

