BOP-ASK: Object-Interaction Reasoning for Vision-Language Models

Bhat, Vineet; Kim, Sungsu; Blukis, Valts; Heinrich, Greg; Krishnamurthy, Prashanth; Karri, Ramesh; Birchfield, Stan; Khorrami, Farshad; Tremblay, Jonathan

Vineet Bhat, Sungsu Kim, Valts Blukis, Greg Heinrich, Prashanth Krishnamurthy, Ramesh Karri, Stan Birchfield, Farshad Khorrami, Jonathan Tremblay; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 16746-16757

Abstract

Vision-Language Models (VLMs) have achieved impressive performance on spatial reasoning benchmarks, yet these evaluations mask critical weaknesses in understanding object interactions. Current benchmarks test high-level relationships ("left of," "behind", etc.) but ignore fine-grained spatial understanding needed for real-world applications: precise 3D localization, physical compatibility between objects, object affordances and multi-step spatial planning. In this work, we present BOP-ASK, a novel large-scale dataset for object-interaction reasoning for both training and benchmarking. Our data generation pipeline leverages 6D object poses from the Benchmark for Object Pose Estimation (BOP) datasets from which we derive fine-grained annotations such as grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object relationships. BOP-ASK comprises over 150k images and 33M question-answer pairs spanning six tasks (four novel), providing a rich resource for training and evaluating VLMs. We evaluate proprietary and open-sourced VLMs, and conduct human evaluations on BOP-ASK-core, a contributed test benchmark. We also release BOP-ASK-lab, an out-of-distribution benchmark with images not sourced from BOP, enabling testing of generalization. Our experiments demonstrate that models trained on BOP-ASK outperform baselines and exhibit emergent capabilities such as precise object and grasp pose estimation, trajectory planning, and fine-grained object-centric spatial reasoning in cluttered environments. Project website: https://bop-ask.github.io/

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Bhat_2026_CVPR, author = {Bhat, Vineet and Kim, Sungsu and Blukis, Valts and Heinrich, Greg and Krishnamurthy, Prashanth and Karri, Ramesh and Birchfield, Stan and Khorrami, Farshad and Tremblay, Jonathan}, title = {BOP-ASK: Object-Interaction Reasoning for Vision-Language Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {16746-16757} }