Hyperbolic Gramian Volumes for Multimodal Alignment

Na, Saiyang; Jiang, Feng; Zhou, Qifeng; Zhong, Wenliang; Dang, Thao M.; Guo, Yuzhi; Ma, Hehuan; Li, Chunyuan; An, Weizhi; Huang, Junzhou

Saiyang Na, Feng Jiang, Qifeng Zhou, Wenliang Zhong, Thao M. Dang, Yuzhi Guo, Hehuan Ma, Chunyuan Li, Weizhi An, Junzhou Huang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 37756-37765

Abstract

Multimodal contrastive learning typically relies on pairwise similarities for alignment, but recent work has shown that Gramian volumes can capture higher-order correlations across modalities. However, Euclidean Gramian volumes suffer from volume collapse under L2 normalization, concentrating near unity with minimal discriminative variance. Hyperbolic geometry's exponential volume growth naturally addresses this via variance preservation, motivating us to extend Gramian alignment to hyperbolic space. Yet preliminary experiments reveal that pure hyperbolic geometry alone is insufficient: while it preserves variance, it underperforms Euclidean baselines on cross-category discrimination. We introduce HyperGRAM, a hybrid geometry framework that combines Euclidean discriminative stability with hyperbolic semantic variance through learnable mixing. Using the numerically stable Lorentz model, HyperGRAM enables volumes to serve dual roles: discriminating matched from mismatched triplets while preserving semantic sensitivity within matched pairs that reflects interpretation spaces (the set of valid multimodal realizations). Evaluation across four video-text benchmarks demonstrates that hybrid geometry consistently outperforms both pure Euclidean and pure hyperbolic variants, achieving significant zero-shot improvements with cross-dataset semantic sensitivity exhibiting contrasting correlation patterns.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Na_2026_CVPR, author = {Na, Saiyang and Jiang, Feng and Zhou, Qifeng and Zhong, Wenliang and Dang, Thao M. and Guo, Yuzhi and Ma, Hehuan and Li, Chunyuan and An, Weizhi and Huang, Junzhou}, title = {Hyperbolic Gramian Volumes for Multimodal Alignment}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {37756-37765} }