UnityVideo: Unified Multi-Modal Multi-Task Learning
for Enhancing World-Aware Video Generation

Anonymous author

Supplementary Material

UnityVideo Pipeline

Figure: Overview of the UnityVideo Framework

⏳ It may take some time to load all videos. Thank you for your patience!

📹 Teaser Videos

JointGen

Sample 1 - View 1
Sample 1 - View 2
Sample 2 - View 1
Sample 2 - View 2
Sample 3 - View 1
Sample 3 - View 2
Sample 4 - View 1
Sample 4 - View 2

Estimator

Estimation 1
Estimation 2
Estimation 3
Estimation 4
Estimation 5
Estimation 6

ControGen

Control 1-1
Control 3-2
Control 1-2
Control 4-1
Control 3-1
Control 4-2
Control 2-1
Control 2-2

✨ Method Showcases

JointGen - Text to Video

T2A 1 - RGB
T2A 1 - Skeleton
T2A 2 - RGB
T2A 2 - Segmentation
T2A 3 - RGB
T2A 3 - Segmentation
T2A 4 - RGB
T2A 4 - RAFT

Estimator - Video to Modality

V2F 1 - RGB
V2F 1 - Skeleton
V2F 2 - RGB
V2F 2 - Skeleton
V2F 4 - RGB
V2F 4 - RAFT
V2F 3 - Depth
V2F 3 - DensePose

ControGen - Modality to Video

F2V 1 - Depth
F2V 1 - RGB
F2V 2 - RAFT
F2V 2 - RGB

🔍 Baseline Comparisons

Case 0 - Wan
Case 0 - UnityVideo
Case 1 - Hunyuan
Case 1 - UnityVideo
Case 2 - Hunyuan
Case 2 - UnityVideo
Case 3 - Hunyuan
Case 3 - UnityVideo
Case 4 - Hunyuan
Case 4 - UnityVideo
Case 5 - Hunyuan
Case 5 - UnityVideo
Case 6 - Hunyuan
Case 6 - UnityVideo
Case 7 - Hunyuan
Case 7 - UnityVideo
Case 8 - UnityVideo
Case 8 - VACE
Case 9 - UnityVideo
Case 9 - VACE