Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets From 3D Scans

Ainaz Eftekhar, Alexander Sax, Jitendra Malik, Amir Zamir; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 10786-10796


Computer vision now relies on data, but we know surprisingly little about what factors in the data affect performance. We argue that this stems from the way data is collected. Designing and collecting static datasets of images (or videos) locks us in to specific design choices and limits us to post-hoc analyses. In practice, vision datasets only include specific domains and tasks. This not only makes it necessary and difficult to combine datsets, but leads to scattershot overall coverage that frustrates systemic research into the interaction of tasks, data, models, and learning algorithms. For example, if a model trained for ImageNet classification on ImageNet transfers better to CoCo than does a model trained for Kitti depth estimation--is that due to the difference in tasks or the different training data? We note that one way to do this is to use a comprehensive, standardized scene representation that contains extra information about the scene, and then to use that to create a specific dataset of study. We introduce a platform for doing this. Specifically, we provide a pipeline that takes as input a 3D scans and generates multi-task datasets of mid-level cues. The pipeline exposes complete control over the generation process, is implemented in mostly python, and we provide ecosystem tools such as a Docker and PyTorch dataloaders. We also provide a starter dataset of several recent 3D scan datasets, processed into standard static datasets of mid-level cues. We show that this starter dataset (generated from the annotator pipeline) is reliable; it yields models that provide state-of-the-art performance for several tasks. It yields human-level surface normal estimation performance on OASIS, despite having never seen OASIS data during training. With the proliferation of cheaper 3D sensors (e.g. on the newest iPhone), we anticipate that releasing an automated tool for this processing pipeline will allow the starter set to continue to expand and cover more domains. We examine a few small examples of using this procedure to analyze the relationship of data, tasks, models and learning algorithms, and suggest several exciting directions that are well out of the scope of this paper.

Related Material

[pdf] [arXiv]
@InProceedings{Eftekhar_2021_ICCV, author = {Eftekhar, Ainaz and Sax, Alexander and Malik, Jitendra and Zamir, Amir}, title = {Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets From 3D Scans}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021}, pages = {10786-10796} }