MED-VT: Multiscale Encoder-Decoder Video Transformer With Application To Object Segmentation

Rezaul Karim, He Zhao, Richard P. Wildes, Mennatullah Siam; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 6323-6333

Abstract


Multiscale video transformers have been explored in a wide variety of vision tasks. To date, however, the multiscale processing has been confined to the encoder or decoder alone. We present a unified multiscale encoder-decoder transformer that is focused on dense prediction tasks in videos. Multiscale representation at both encoder and decoder yields key benefits of implicit extraction of spatiotemporal features (i.e. without reliance on input optical flow) as well as temporal consistency at encoding and coarse-to-fine detection for high-level (e.g. object) semantics to guide precise localization at decoding. Moreover, we propose a transductive learning scheme through many-to-many label propagation to provide temporally consistent predictions.We showcase our Multiscale Encoder-Decoder Video Transformer (MED-VT) on Automatic Video Object Segmentation (AVOS) and actor/action segmentation, where we outperform state-of-the-art approaches on multiple benchmarks using only raw images, without using optical flow.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Karim_2023_CVPR, author = {Karim, Rezaul and Zhao, He and Wildes, Richard P. and Siam, Mennatullah}, title = {MED-VT: Multiscale Encoder-Decoder Video Transformer With Application To Object Segmentation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2023}, pages = {6323-6333} }