Our main contributions:
1. We propose TUNA, a native unified multimodal model with unified visual representations, enabling
image/video understanding, image/video generation, and image editing within a single framework.
2. Our extensive experiments show that TUNA's unified visual representation is highly effective, achieving
state-of-the-art performance across multiple multimodal understanding and generation tasks.
3. We further perform a comprehensive ablation study, demonstrating the superiority of our unified visual
representation design over existing methods employing decoupled representations.
Hover over each video to see the corresponding text prompt.