TUNA: Taming Unified Visual Representations for
Native Unified Multimodal Models

Anonymous CVPR submission

Our main contributions:

1. We propose TUNA, a native unified multimodal model with unified visual representations, enabling image/video understanding, image/video generation, and image editing within a single framework.
2. Our extensive experiments show that TUNA's unified visual representation is highly effective, achieving state-of-the-art performance across multiple multimodal understanding and generation tasks.
3. We further perform a comprehensive ablation study, demonstrating the superiority of our unified visual representation design over existing methods employing decoupled representations.

Text-to-Video Generation Results

Hover over each video to see the corresponding text prompt.