-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Yuan_2026_CVPR, author = {Yuan, Kun and Sun, Min and Chen, Zhen and Lozano, Alejandro and He, Xiangteng and Li, Shi and Navab, Nassir and Sun, Xiaoxiao and Padoy, Nicolas and Yeung-Levy, Serena}, title = {From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {42649-42658} }
From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature
Abstract
There is growing interest in biomedical vision--language models trained on scientific literature. However, most pipelines compress rich multi-panel figures and long captions into coarse figure-level pairs, discarding the fine-grained correspondences clinicians rely on when zooming into local structures. We introduce Panel2Patch, a data pipeline that mines hierarchical structure from multi-panel, marker-heavy biomedical figures and their surrounding text, and converts them into multi-granular supervision. Given figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs aligned image--text pairs at the figure, panel, and region levels, preserving local semantics instead of treating each figure as a single sample. Built on this corpus, we develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases in a shared embedding space. Applying Panel2Patch to a small subset of literature figures yields substantially better performance than prior pipelines, demonstrating that exploiting hierarchical figure structure can provide more effective supervision with less pretraining data.
Related Material

