-
[pdf]
[supp]
[bibtex]@InProceedings{Mazzini_2025_ICCV, author = {Mazzini, Davide and Raimondi, Alberto and Abbate, Bruno and Fischetti, Daniel and Woollard, David}, title = {RetailAction: Dataset for Multi-View Spatio-Temporal Localization of Human-Object Interactions in Retail}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {2400-2408} }
RetailAction: Dataset for Multi-View Spatio-Temporal Localization of Human-Object Interactions in Retail
Abstract
We introduce RetailAction, a novel dataset designed for multi-view spatio-temporal localization of human-object interactions in retail stores. Existing datasets either provide video-level action classification only (without spatio-temporal localization), or, when such annotations are present, they are limited in scale and not specific to the retail sector, often lacking real-world store data. RetailAction addresses these limitations by focusing on interactions between actual customers and store products, captured from multiple top-view cameras in 10 different real-world convenience stores. The dataset consists of 21,000 samples, each containing two synchronized videos with a total duration of 41 hours. In addition to the videos, the dataset includes annotations detailing precise interaction points for both views, temporal ranges, and action categories for each interaction. In this paper, we describe the data collection process and we provide an analysis of the dataset's statistics. We also present a baseline model for spatio-temporal localization of interaction points and compare different state-of-the-art backbones. Finally, we present a novel set of evaluation metrics tailored to this use case. RetailAction aims to facilitate research on fine-grained action recognition and localization, offering a valuable resource for developing advanced retail analytics applications.
Related Material
