Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video

Liao, Guiqiu; Jogan, Matjaz; Koushik, Sai; Eaton, Eric; Hashimoto, Daniel A.

Guiqiu Liao, Matjaz Jogan, Sai Koushik, Eric Eaton, Daniel A. Hashimoto; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 8002-8012

Abstract

Weakly supervised video object segmentation (WSVOS) enables the identification of segmentation maps without requiring extensive annotations of object masks relying instead on coarse video labels indicating object presence. Weakly supervised semantic segmentation of objects in surgical videos is however more challenging due to a complex interaction of multiple transient objects such as surgical tools moving in and out of the surgical field. In this scenario state-of-the-art WSVOS methods struggle to learn accurate segmentation maps. We address this problem by introducing ViDeo Spatio-Temporal disentanglement Networks (VDST-Net) a framework to disentangle complex spatiotemporal object interactions using semi-decoupled knowledge distillation to predict high-quality class activation maps (CAMs). A teacher network is designed to help a temporal-reasoning student network resolve activation conflicts as the student leverages temporal dependencies when specifics about object location and timing in the video are not provided. We demonstrate the efficacy of our framework on a challenging surgical video dataset where objects are on average present in less than 60% of annotated frames and compare our method to state-of-the-art methods on surgical data and on a public dataset commonly used to benchmark WSVOS. Our method outperforms state-of-the-art techniques and generates accurate segmentation masks under video-level weak supervision. Our code is available at: https://github.com/PCASOlab/VDST-net.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Liao_2025_WACV, author = {Liao, Guiqiu and Jogan, Matjaz and Koushik, Sai and Eaton, Eric and Hashimoto, Daniel A.}, title = {Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {8002-8012} }