V2A - Vision to Action: Learning robotic arm actions based on vision and language

Michal Nazarczuk, Krystian Mikolajczyk; Proceedings of the Asian Conference on Computer Vision (ACCV), 2020

Abstract


In this work, we present a new AI task - Vision to Action (V2A) - where an agent (robotic arm) is asked to perform a high-level task with objects (e.g. stacking) present in a scene. The agent has to suggest a plan consisting of primitive actions (e.g. simple movement, grasping) in order to successfully complete the given task. Instructions are formulated in a way that forces the agent to perform visual reasoning over the presented scene before inferring the actions. We extend the recently introduced dataset SHOP-VRB with task instructions for each scene as well as an engine capable of assessing whether the sequence of primitives leads to a successful task completion. We also propose a novel approach based on multimodal attention for this task and demonstrate its performance on the new dataset.

Related Material


[pdf] [code]
[bibtex]
@InProceedings{Nazarczuk_2020_ACCV, author = {Nazarczuk, Michal and Mikolajczyk, Krystian}, title = {V2A - Vision to Action: Learning robotic arm actions based on vision and language}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {November}, year = {2020} }