Video OWL-ViT: Temporally-consistent Open-world Localization in Video

Georg Heigold, Matthias Minderer, Alexey Gritsenko, Alex Bewley, Daniel Keysers, Mario Lučić, Fisher Yu, Thomas Kipf; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 13802-13811


We present an architecture and a training recipe that adapts pretrained open-world image models to localization in videos. Understanding the open visual world (without being constrained by fixed label spaces) is crucial for many real-world vision tasks. Contrastive pre-training on large image-text datasets has recently led to significant improvements for image-level tasks. For more structured tasks involving object localization applying pre-trained models is more challenging. This is particularly true for video tasks, where task-specific data is limited. We show successful transfer of open-world models by building on the OWL-ViT open-vocabulary detection model and adapting it to video by adding a transformer decoder. The decoder propagates object representations recurrently through time by using the output tokens for one frame as the object queries for the next. Our model is end-to-end trainable on video data and enjoys improved temporal consistency compared to tracking-by-detection baselines, while retaining the open-world capabilities of the backbone detector. We evaluate our model on the challenging TAO-OW benchmark and demonstrate that open-world capabilities, learned from large-scale image-text pretraining, can be transferred successfully to open-world localization across diverse videos.

Related Material

[pdf] [supp]
@InProceedings{Heigold_2023_ICCV, author = {Heigold, Georg and Minderer, Matthias and Gritsenko, Alexey and Bewley, Alex and Keysers, Daniel and Lu\v{c}i\'c, Mario and Yu, Fisher and Kipf, Thomas}, title = {Video OWL-ViT: Temporally-consistent Open-world Localization in Video}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {13802-13811} }