Archon

A Unified Multimodal Model for Holistic Digital Human Generation
Description of image

Archon extends Multimodal Language Models for holistic human generation.

We show our results of any-to-any generation and any modality editing capabilities on description (text), script (text), speech (audio), animation, image, semantic segmentation, and video.



Any-to-Any Generation

Description → Script + Speech + Animation + Segmentation + Video

We demonstrate the generation of script, speech, animation, segmentation, and video driven solely by descriptions. The input description are displayed above each example. The generated script is shown below each video. The video visualization presents the generated animation, segmentation, and final video arranged from left to right.




Description + Script → Speech + Animation + Segmentation + Video

We showcase results where descriptions and scripts are employed to generate speech, animation, segmentation, and video. The corresponding input description and script are provided above each example. The video composite displays the generated animation, segmentation, and final video arranged from left to right.




Speech → Description + Script + Animation + Segmentation + Video

We demonstrate results where speech is used to generate description, script, animation, segmentation, and video. The inferred description and script are displayed below each example. The video composite visualizes the generated animation, segmentation, and final video arranged from left to right.




Animation → Segmentation + Video

We present results where animation serves as the condition to generate segmentation and video. The composite video displays the input animation, generated segmentation, and final video arranged from left to right.




Segmentation → Video

We present results where segmentation is utilized to generate video. Each demo displays the input segmentation and the synthesized video side-by-side (left to right).




Video (silent) → Description + Speech + Animation + Segmentation

We showcase results for video understanding , video dubbing, animation tracking, and video segmentation. From an input video, we parse the corresponding description, speech, animation, while obtaining segmentation via an off-the-shelf model. The inferred description is displayed below each example. The visual composite presents the input video, extracted animation, and video segmentation arranged from left to right.




Any Modality Editing

Script Editing

We showcase script editing. We modify the script of the original video (left) to generate an edited video (right) that articulates the new script while faithfully preserving the original appearance and voice.




Editing using Description

We present results for video editing via description. We modify the description of the original video to generate an edited video with a new appearance. When identity-defining attributes are altered (e.g., gender swap), we simultaneously adapt the voice to match the new identity (see second row). Notably, all unedited attributes and the original script are strictly preserved.




Animation Editing (Face Reenactment)

We present results for animation editing (face reenactment). We employ a reference video (left) to drive the motion of the original video. The resulting edited video (right) adopts the reference animation while retaining the original subject's appearance.




Comparisons

We present comparisons of speech-driven video generation against state-of-the-art methods. From left to right, the videos display: Ground Truth, Aniportrait, Echomimic, Hallo3, and Ours.