GauMVC: Generative Decoupled Gaussian Representation for Human-centric Multi-view Video Compression

teaser

For multi-view video input, our method splits the scene into a static background and a dynamic human. The background uses a consistent 3D Gaussian field; human appearance and motion are encoded by a few key views and SMPL parameters. At the decoder, these generate a Gaussian avatar driven by SMPL, which is fused with the background for high-fidelity reconstruction.

Abstract

Human-centric multi-view video has a clear semantic structure: a static background and dynamic human motion. We propose a generative compression framework that explicitly decouples these components. The background is modeled once with 3D Gaussian Splatting, while the human is represented by a personalized Gaussian avatar reconstructed from a sparse set of key views that are transmitted only once and driven by compact per-frame pose parameters from the Skinned Multi-Person Linear (SMPL) model.

The encoder sends only three elements: the background, the key views, and the SMPL parameters, enabling high-fidelity multi-viewpoint synthesis at dramatically reduced bitrates. This shifts compression from low-level redundancy removal to semantics-aware generative modeling.

Experiments across multiple human-centric datasets demonstrate superior rate–distortion performance, particularly for long and densely captured sequences, and naturally enable semantic editing.

Qualitative comparison on AvatarRex

Framework

pipeline

The detailed architecture of our proposed GauMVC. (A) Static batch pipeline. Static regions are extracted from multi-view sequences, used to initialize and optimize a background Gaussian model under occlusion guidance, and finally compressed into a compact Gaussian representation. (B) Dynamic batch pipeline. The dynamic branch models human motion and appearance. It first extracts and compresses SMPL parameters for pose and shape, then generates a personalized Gaussian avatar from key views through region-wise fusion, enabling compact and high-fidelity free-viewpoint video synthesis.

Comparison with baseline on Actor05 of ENerf-Outdoor

comparison