-
[pdf]
[supp]
[bibtex]@InProceedings{Zhao_2025_CVPR, author = {Zhao, Zhaoran and Lu, Peng and Zhang, Anran and Li, Peipei and Li, Xia and Liu, Xuannan and Hu, Yang and Chen, Shiyi and Wang, Liwei and Guo, Wenhao}, title = {Can Machines Understand Composition? Dataset and Benchmark for Photographic Image Composition Embedding and Understanding}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {14411-14421} }
Can Machines Understand Composition? Dataset and Benchmark for Photographic Image Composition Embedding and Understanding
Abstract
With the rapid growth of social media and digital photography, visually appealing images have become essential for effective communication and emotional engagement. Among the factors influencing aesthetic appeal, composition--the arrangement of visual elements within a frame--plays a crucial role. In recent years, specialized models for photographic composition have achieved impressive results across various aesthetic tasks. Meanwhile, rapidly advancing multimodal large language models (MLLMs) have excelled in several visual perception tasks. However, their ability to embed and understand compositional information remains underexplored, primarily due to the lack of suitable evaluation datasets. To address this gap, we introduce the Photographic Image Composition Dataset (PICD), a large-scale dataset consisting of 36,857 images categorized into 24 composition categories across 355 diverse scenes. We demonstrate the advantages of PICD over existing datasets in terms of data scale, composition category, label quality, and scene diversity. Building on PICD, we establish benchmarks to evaluate the composition embedding capabilities of specialized models and the compositional understanding ability of MLLMs. To enable efficient and effective evaluation, we propose a novel Composition Discrimination Accuracy (CDA) metric. Our evaluation highlights the limitations of current models and provides insights into directions for improving their ability to embed and understand composition.
Related Material