SnapMem: Snapshot-based 3D Scene Memory for Embodied Exploration and Reasoning

CVPR Anonymous Submission

Abstract

Constructing a compact and informative scene representation for 3D scenes is essential for effective embodied exploration and reasoning, especially in complex environments over long periods. Existing scene representations, such as object-centric 3D scene graphs, have significant limitations. They oversimplify spatial relationships by modeling scenes as individual objects, with inter-object relationships described by restrictive texts, making it difficult to answer queries that require nuanced spatial understanding. Furthermore, these representations lack natural mechanisms for active exploration and memory management, which hampers their application to lifelong autonomy. In this work, we propose SnapMem, a novel snapshot-based scene representation serving as 3D scene memory for embodied agents. SnapMem employs informative images, termed Memory Snapshots, to capture rich visual information of explored regions. It also integrates frontier-based exploration by introducing Frontier Snapshots—glimpses of unexplored areas—that enable agents to make informed exploration decisions by considering both known and potential new information. Meanwhile, to support lifelong memory in active exploration settings, we further present an incremental construction pipeline for SnapMem, as well as an effective memory retrieval technique for memory management. Experimental results on three benchmarks demonstrate that SnapMem significantly enhances agents' exploration and reasoning capabilities in 3D environments over extended periods, highlighting its potential for advancing applications in embodied AI.

Demo

We present demos of embodied question-answering in continuous exploration scenarios. The agent is initialized at a specific location in an unknown environment and tasked with answering a series of questions by exploring the environment and gathering the necessary information. The agent either selects a frontier to explore or uses the dynamically constructed scene memory to reason about the questions.

Each demo video below represents an exploration episode in a specific scene within Habitat-sim, where the agent is required to answer 6-8 questions continuously. In each frame, the current question is displayed at the top, followed by four illustrations:

  • The first is a top-down map of the scene. The notations in the top-down map are explained in the following:
    Legend for Top-down Map
  • The second illustration shows the agent's egocentric view from its current position.
  • The third illustration shows the agent's decision at each step with red hightlight. The agent can choose either a frontier or a memory snapshot, and the reason for the choice is displayed at the bottom of the frame.
  • The fourth illustration shows all the current frontier snapshots at the top and all the remaining memory snapshots after prefiltering at the bottom. The highlighted snapshot is the choice at the current step.