Is an Object-Centric Video Representation Beneficial for Transfer?

Chuhan Zhang, Ankush Gupta, Andrew Zisserman; Proceedings of the Asian Conference on Computer Vision (ACCV), 2022, pp. 1976-1994


The objective of this work is to learn an object-centric video representation, with the aim of improving transferability to novel tasks, ie, tasks different from the pre-training task of action classification.To this end, we introduce a new object-centric video recognition model based on a transformer architecture. The model learns a set of object-centric summary vectors for the video, and uses these vectors to fuse the visual and spatio-temporal trajectory `modalities' of the video clip. We also introduce a novel trajectory contrast loss to further enhance objectness in these summary vectors. With experiments on four datasets -- SomethingSomething-V2, SomethingElse, Action Genome and EpicKitchens -- we show that the object-centric model outperforms prior video representations (both object-agnostic and object-aware), when: (1) classifying actions on unseen objects and unseen environments; (2) low-shot learning to novel classes; (3) linear probe to other downstream tasks; as well as (4) for standard action classification.

Related Material

[pdf] [supp] [arXiv] [code]
@InProceedings{Zhang_2022_ACCV, author = {Zhang, Chuhan and Gupta, Ankush and Zisserman, Andrew}, title = {Is an Object-Centric Video Representation Beneficial for Transfer?}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2022}, pages = {1976-1994} }