OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

Hu, Ming; Yuan, Kun; Shen, Yaling; Tang, Feilong; Xu, Xiaohao; Zhou, Lin; Li, Wei; Chen, Ying; Xu, Zhongxing; Peng, Zelin; Yan, Siyuan; Srivastav, Vinkle; Song, Diping; Li, Tianbin; Shi, Danli; Ye, Jin; Padoy, Nicolas; Navab, Nassir; He, Junjun; Ge, Zongyuan

Ming Hu, Kun Yuan, Yaling Shen, Feilong Tang, Xiaohao Xu, Lin Zhou, Wei Li, Ying Chen, Zhongxing Xu, Zelin Peng, Siyuan Yan, Vinkle Srivastav, Diping Song, Tianbin Li, Danli Shi, Jin Ye, Nicolas Padoy, Nassir Navab, Junjun He, Zongyuan Ge; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 19838-19849

Abstract

Vision-language pretraining (VLP) enables open-world generalization beyond predefined labels, a critical capability in surgery due to the diversity of procedures, instruments, and patient anatomies. However, applying VLP to ophthalmic surgery presents unique challenges, including limited vision-language data, intricate procedural workflows, and the need for hierarchical understanding, ranging from fine-grained surgical actions to global clinical reasoning. To address these, we introduce OphVL, a large-scale, hierarchically structured dataset containing over 375K video-text pairs, making it 15x larger than existing surgical VLP datasets. OphVL captures a diverse range of ophthalmic surgical attributes, including surgical phases, operations, actions, instruments, medications, disease causes, surgical objectives, and postoperative care recommendations. By aligning short clips with detailed narratives and full-length videos with structured titles, OphVL provides both fine-grained surgical details and high-level procedural context. Building on OphVL, we propose OphCLIP, a hierarchical retrieval-augmented VLP framework. OphCLIP leverages silent surgical videos as a knowledge base, retrieving semantically relevant content to enhance narrated procedure learning. This enables OphCLIP to integrate explicit linguistic supervision with implicit visual knowledge, improving ophthalmic workflow modeling. Evaluations across 11 benchmark datasets for surgical phase recognition and multi-instrument identification demonstrate OphCLIP's robust generalization and superior performance, establishing it as a foundation model for ophthalmic surgery.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Hu_2025_ICCV, author = {Hu, Ming and Yuan, Kun and Shen, Yaling and Tang, Feilong and Xu, Xiaohao and Zhou, Lin and Li, Wei and Chen, Ying and Xu, Zhongxing and Peng, Zelin and Yan, Siyuan and Srivastav, Vinkle and Song, Diping and Li, Tianbin and Shi, Danli and Ye, Jin and Padoy, Nicolas and Navab, Nassir and He, Junjun and Ge, Zongyuan}, title = {OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {19838-19849} }