SurgXBench: Explainable Vision-Language Model Benchmark for Surgery

Jiajun Cheng, Xianwu Zhao, Sainan Liu, Xiaofan Yu, Ravi Prakash, Patrick J. Codd, Jonathan Elliott Katz, Shan Lin; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026, pp. 8188-8198

Abstract


Innovations in digital intelligence are transforming robotic surgery through more informed decision-making. Real-time awareness of surgical instrument presence and actions (e.g., cutting tissue) is essential, yet despite decades of research, most machine learning models rely on small datasets and still struggle to generalize. Recently, Vision-Language Models (VLMs) have achieved transformative advances in multimodal reasoning, suggesting strong potential for intelligent robotic surgery. However, surgical VLMs remain underexplored, and existing models show limited performance, underscoring the need for systematic benchmarks to assess their capabilities, limitations, and future development. To this end, we benchmark the zero-shot performance of several advanced VLMs on two public robotic-assisted laparoscopic datasets for instrument and action classification. Beyond standard evaluation, we integrate explainable AI to visualize VLM attention and uncover causal explanations behind predictions, providing a previously underexplored perspective for assessing model reliability. We also propose explainability-based metrics to complement standard evaluations. Our analysis reveals that surgical VLMs, despite domain-specific training, often rely on weak contextual cues rather than clinically meaningful visual evidence, highlighting the need for stronger visual and reasoning supervision in surgical applications. The code is provided in our public repository at https://github.com/jiajun344/SurgXBench-Explainable-Vision-Language-Model-Benchmark-for-Surgery

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Cheng_2026_WACV, author = {Cheng, Jiajun and Zhao, Xianwu and Liu, Sainan and Yu, Xiaofan and Prakash, Ravi and Codd, Patrick J. and Katz, Jonathan Elliott and Lin, Shan}, title = {SurgXBench: Explainable Vision-Language Model Benchmark for Surgery}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {March}, year = {2026}, pages = {8188-8198} }