SEED-Bench: Benchmarking Multimodal Large Language Models

Li, Bohao; Ge, Yuying; Ge, Yixiao; Wang, Guangzhi; Wang, Rui; Zhang, Ruimao; Shan, Ying

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, Ying Shan; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13299-13308

Abstract

Multimodal large language models (MLLMs) building upon the foundation of powerful large language models (LLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs (acting like a combination of GPT-4V and DALL-E 3). However existing MLLM benchmarks remain limited to assessing only models' comprehension ability of single image-text inputs failing to keep up with the strides made in MLLMs. A comprehensive benchmark is imperative for investigating the progress and uncovering the limitations of current MLLMs. In this work we categorize the capabilities of MLLMs into hierarchical levels from L_0 to L_4 based on the modalities they can accept and generate and propose SEED-Bench a comprehensive benchmark that evaluates the hierarchical capabilities of MLLMs. Specifically SEED-Bench comprises 24K multiple-choice questions with accurate human annotations which spans 27 dimensions including the evaluation of both text and image generation. Multiple-choice questions with groundtruth options derived from human annotation enables an objective and efficient assessment of model performance eliminating the need for human or GPT intervention during evaluation. We further evaluate the performance of 22 prominent open-source MLLMs and summarize valuable observations. By revealing the limitations of existing MLLMs through extensive evaluations we aim for SEED-Bench to provide insights that will motivate future research towards the goal of General Artificial Intelligence.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Li_2024_CVPR, author = {Li, Bohao and Ge, Yuying and Ge, Yixiao and Wang, Guangzhi and Wang, Rui and Zhang, Ruimao and Shan, Ying}, title = {SEED-Bench: Benchmarking Multimodal Large Language Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {13299-13308} }