Human-centric multi-view video has a clear semantic structure: a static background and dynamic human motion. We propose a generative compression framework that explicitly decouples these components. The background is modeled once with 3D Gaussian Splatting, while the human is represented by a personalized Gaussian avatar reconstructed from a sparse set of key views that are transmitted only once and driven by compact per-frame pose parameters from the Skinned Multi-Person Linear (SMPL) model.
The encoder sends only three elements: the background, the key views, and the SMPL parameters, enabling high-fidelity multi-viewpoint synthesis at dramatically reduced bitrates. This shifts compression from low-level redundancy removal to semantics-aware generative modeling.
Experiments across multiple human-centric datasets demonstrate superior rate–distortion performance, particularly for long and densely captured sequences, and naturally enable semantic editing.
The detailed architecture of our proposed GauMVC. (A) Static batch pipeline. Static regions are extracted from multi-view sequences, used to initialize and optimize a background Gaussian model under occlusion guidance, and finally compressed into a compact Gaussian representation. (B) Dynamic batch pipeline. The dynamic branch models human motion and appearance. It first extracts and compresses SMPL parameters for pose and shape, then generates a personalized Gaussian avatar from key views through region-wise fusion, enabling compact and high-fidelity free-viewpoint video synthesis.