TRINS: Towards Multimodal Language Models that Can Read

Ruiyi Zhang, Yanzhe Zhang, Jian Chen, Yufan Zhou, Jiuxiang Gu, Changyou Chen, Tong Sun; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 22584-22594

Abstract


Large multimodal language models have shown remarkable proficiency in understanding and editing images. However a majority of these visually-tuned models struggle to comprehend the textual content embedded in images primarily due to the limitation of training data. In this work we introduce TRINS: a Text-Rich image1 INStruction dataset with the objective of enhancing the reading ability of the multimodal large language model. TRINS is built upon LAION 2 using hybrid data annotation strategies that include machine-assisted and human-assisted annotation process. It contains 39153 text-rich images captions and 102437 questions. Specifically we show that the number of words per annotation in TRINS is significantly longer than that of related datasets providing new challenges. Furthermore we introduce a simple and effective architecture called a Language-Vision Reading Assistant (LaRA) which is good at understanding textual content within images. LaRA outperforms existing state-of-the-art multimodal large language models on the TRINS dataset as well as other classical benchmarks. Lastly we conducted a comprehensive evaluation with TRINS on various text-rich image understanding and generation tasks demonstrating its effectiveness.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Zhang_2024_CVPR, author = {Zhang, Ruiyi and Zhang, Yanzhe and Chen, Jian and Zhou, Yufan and Gu, Jiuxiang and Chen, Changyou and Sun, Tong}, title = {TRINS: Towards Multimodal Language Models that Can Read}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {22584-22594} }