-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Cheng_2026_WACV, author = {Cheng, Jiajun and Zhao, Xianwu and Liu, Sainan and Yu, Xiaofan and Prakash, Ravi and Codd, Patrick J. and Katz, Jonathan Elliott and Lin, Shan}, title = {SurgXBench: Explainable Vision-Language Model Benchmark for Surgery}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {March}, year = {2026}, pages = {8188-8198} }
SurgXBench: Explainable Vision-Language Model Benchmark for Surgery
Abstract
Innovations in digital intelligence are transforming robotic surgery through more informed decision-making. Real-time awareness of surgical instrument presence and actions (e.g., cutting tissue) is essential, yet despite decades of research, most machine learning models rely on small datasets and still struggle to generalize. Recently, Vision-Language Models (VLMs) have achieved transformative advances in multimodal reasoning, suggesting strong potential for intelligent robotic surgery. However, surgical VLMs remain underexplored, and existing models show limited performance, underscoring the need for systematic benchmarks to assess their capabilities, limitations, and future development. To this end, we benchmark the zero-shot performance of several advanced VLMs on two public robotic-assisted laparoscopic datasets for instrument and action classification. Beyond standard evaluation, we integrate explainable AI to visualize VLM attention and uncover causal explanations behind predictions, providing a previously underexplored perspective for assessing model reliability. We also propose explainability-based metrics to complement standard evaluations. Our analysis reveals that surgical VLMs, despite domain-specific training, often rely on weak contextual cues rather than clinically meaningful visual evidence, highlighting the need for stronger visual and reasoning supervision in surgical applications. The code is provided in our public repository at https://github.com/jiajun344/SurgXBench-Explainable-Vision-Language-Model-Benchmark-for-Surgery
Related Material
