MIMIC: Masked Image Modeling with Image Correspondences

Marathe, Kalyani; Bigverdi, Mahtab; Khan, Nishat; Kundu, Tuhin; Howe, Patrick; S, Sharan Ranjit; Bhattad, Anand; Kembhavi, Aniruddha; Shapiro, Linda G.; Krishna, Ranjay

Kalyani Marathe, Mahtab Bigverdi, Nishat Khan, Tuhin Kundu, Patrick Howe, Sharan Ranjit S, Anand Bhattad, Aniruddha Kembhavi, Linda G. Shapiro, Ranjay Krishna; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 718-727

Abstract

Dense pixel-specific representation learning at scale has been bottlenecked due to the unavailability of large-scale multi-view datasets. Current methods for building effective pretraining datasets heavily rely on annotated 3D meshes, point clouds, and camera parameters from simulated environments, preventing them from building datasets from real-world data sources where such metadata is lacking. We introduce a pretraining dataset-curation approach that does not require any additional annotations. Our method allows us to generate multi-view datasets from both real-world videos and simulated environments at scale. Specifically, we experiment with two scales: MIMIC-1M with 1.3M and MIMIC-3M with 3.1M multi-view image pairs and train models with different masked image modeling objectives. Through our comprehensive experimental analysis, we show that: Representations trained on our automatically generated MIMIC-3M outperform those learned from expensive crowdsourced datasets (ImageNet-1K) and those learned from synthetic environments (MULTIVIEW-HABITAT) on three dense geometric tasks: depth estimation on NYUv2 (|1.7%), and surface normal estimation on Taskonomy (|2.05%), and depth estimation on Taskonomy(|7.5%) and performs on-par with MULTIVIEW-HABITAT on Taskonomy edges and curvature tasks. Larger dataset (MIMIC-3M) improves performance, which is promising since our curation method can arbitrarily scale to produce even larger datasets. We hope that our work will facilitate a rigorous investigation of the properties of real and synthetic data sources for large-scale dense representation learning.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Marathe_2024_CVPR, author = {Marathe, Kalyani and Bigverdi, Mahtab and Khan, Nishat and Kundu, Tuhin and Howe, Patrick and S, Sharan Ranjit and Bhattad, Anand and Kembhavi, Aniruddha and Shapiro, Linda G. and Krishna, Ranjay}, title = {MIMIC: Masked Image Modeling with Image Correspondences}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {718-727} }