TAM-VT: Transformation-Aware Multi-Scale Video Transformer for Segmentation and Tracking

Goyal, Raghav; Fan, Wan-Cyuan; Siam, Mennatullah; Sigal, Leonid

Raghav Goyal, Wan-Cyuan Fan, Mennatullah Siam, Leonid Sigal; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 8325-8334

Abstract

Video Object Segmentation (VOS) has emerged as an increasingly important problem with availability of larger datasets and more complex and realistic settings which involve long videos with global motion (e.g. in egocentric settings) depicting small objects undergoing both rigid and non-rigid (including state) deformations. While a number of recent approaches have been explored for this task these data characteristics still present challenges. In this work we propose a novel clip-based DETR-style encoder-decoder architecture which focuses on systematically analyzing and addressing aforementioned challenges. Specifically we propose a novel transformation-aware loss that focuses learning on portions of the video where an object undergoes significant deformations - a form of "soft" hard examples mining. Further we propose a multiplicative time-coded memory beyond vanilla additive positional encoding which helps propagate context across long videos. Finally we incorporate these in our proposed holistic multi-scale video transformer for tracking via multi-scale memory matching and decoding to ensure sensitivity and accuracy for long videos and small objects. Our model enables on-line inference with long videos in a windowed fashion by breaking the video into clips and propagating context among them. We illustrate that short clip length and longer memory with learned time-coding are important design choices for improved performance. Collectively these technical contributions enable our model to achieve new state-of-the-art (SoTA) performance on two complex egocentric datasets - VISOR [13] and VOST [44] while achieving comparable to SoTA results on the conventional VOS benchmark DAVIS'17 [38]. Detailed ablations vali date our design choices and provide insights into the impor tance of parameter choices and impact on performance.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Goyal_2025_WACV, author = {Goyal, Raghav and Fan, Wan-Cyuan and Siam, Mennatullah and Sigal, Leonid}, title = {TAM-VT: Transformation-Aware Multi-Scale Video Transformer for Segmentation and Tracking}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {8325-8334} }