YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video

Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan, Vincent Vanhoucke; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5296-5305

Abstract


We introduce a new large-scale data set of video URLs with densely-sampled object bounding box annotations called YouTube-BoundingBoxes (YT-BB). The data set consists of approximately 380,000 video segments about 19s long, automatically selected to feature objects in natural settings without editing or post-processing, with a recording quality often akin to that of a hand-held cell phone camera. The objects represent a subset of the COCO label set. All video segments were human-annotated with high-precision classification labels and bounding boxes at 1 frame per second. The use of a cascade of increasingly precise human annotations ensures a label accuracy above 95% for every class and tight bounding boxes. Finally, we train and evaluate well-known deep network architectures and report baseline figures for per-frame classification and localization. We also demonstrate how the temporal contiguity of video can potentially be used to improve such inferences. The data set can be found at https://research.google.com/youtube-bb. We hope the availability of such large curated corpus will spur new advances in video object detection and tracking.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Real_2017_CVPR,
author = {Real, Esteban and Shlens, Jonathon and Mazzocchi, Stefano and Pan, Xin and Vanhoucke, Vincent},
title = {YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {July},
year = {2017}
}