Mitigating Vision-Text Order Bias in Vision-Language Model

Weilin Gan, Yifan Song, Zhuocheng Yu, Sujian Li; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings, 2026, pp. 9664-9673

Abstract


Vision-Language Models (VLMs) suffer from a significant vision-text order bias where suffix order (visual tokens at the end) overwhelmingly underperforms prefix order (visual tokens at the beginning).%To further examine how the order of visual content influences the performance of VLMs, we permute the position of visual tokens and find that placing visual tokens at the beginning or the end is better than in the middle of a specific sequence, demonstrating a U-shaped curve. %To the best of our knowledge, we are the first to systematically depict the vision-text order bias in VLMs. To explore the impact of visual content order on Vision-Language Models (VLMs), we systematically investigate vision-text position bias. Our preliminary experiments reveal that placing visual tokens at the beginning and end of a sequence yields superior performance compared to placing them in the middle, resulting in a U-shaped performance curve.%However, in real-world scenarios, inputs are often non-prefixed, yielding an urgent need to mitigate the bias and strengthen the performance of suffix order. To mitigate this bias, we introduce Dual-Order Contrastive Decoding (DOCD), a training-free, suffix-enhancing inference scheme that can be lightly applied to VLMs. To address the performance bias brought by non-prefixed input in real-world scenarios, we propose Dual-Order Contrastive Decoding (DOCD), a training-free and lightweight inference scheme designed to enhance non-prefix understanding in VLMs.DOCD parallelly infers on both prefix and suffix orders and contrastively compensates the suffix logits with the prefix logits, utilizing the visual comprehension of prefix order while maintaining close attachment to the visual content of suffix order. Experimental results show that suffix inputs with DOCD can match or even outperform the prefix order in a wide range of difficult benchmarks, including Muirbench, Vlmsareblind, and MMMU-Pro.

Related Material


[pdf]
[bibtex]
@InProceedings{Gan_2026_CVPR, author = {Gan, Weilin and Song, Yifan and Yu, Zhuocheng and Li, Sujian}, title = {Mitigating Vision-Text Order Bias in Vision-Language Model}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings}, month = {June}, year = {2026}, pages = {9664-9673} }