IM-Portrait: Learning 3D-aware Video Diffusion for Photorealistic Talking Heads from Monocular Videos.

Supplementary material.



Results of comparing with the state-of-the-art approaches.

self-reenactment

We compare our method with existing one-shot photoreal talking heads approaches trained from monocular videos in the wild: Face-V2V, EMOPortrait, Portrait4D-v2, XPortrait, follow-your-emoji. Face-V2V gives stable results and follows the input identity, but produces less details due to its low output resolution. EmoPortrait gives sharper results, but suffers from larger identity shifts. Portrait4D-v2 produces good sharpness and less identity shifts, however relatively stiff expressions. The diffusion-based methods Follow-your-emoji and X-portrait render good high-frequency details, but do not precisely follow the input head poses or expressions. In contrast, our method gives good image quality, faithful identity, while follows the input control signal.





Side view visualization

Here we show side view renderings. We generate all the videos with the same setting as self-reenactments where the first frame is used as reference portrait and the rest frames become driving signals. Since we use MPI as our scene representation, the generated talking head videos does not support rendering from excessive camera viewpoint changes. However, our method can still generate reasonable stereo videos for viewing purpose.
We provide stereo videos in the folder stereo_video. We test all the stereo videos with Google Cardboard. Specifically, we tested the stereo videos on a 6.1-inch iPhone 15 Pro in album's preview-mode, slideshow, and fullscreen play. We also tested the stereo videos on a 6.7-inch Google Pixel 7 Pro using fullscreen play.







Generated side view video

We render each generated MPI frame from -5°, 0° and +5°.







Evaluation on synthetic data

Despite trained on real world talking-head videos, our model still generalizes to stylized portraits generated by Stable-Diffusion 3.








Long video generation

We show the long videos generated by our model. Below is the input reference portrait. In each video, driving sequence is on the left and generated video is on the right.