InVERGe: Intelligent Visual Encoder for Bridging Modalities in Report Generation

Ankan Deria, Komal Kumar, Snehashis Chakraborty, Dwarikanath Mahapatra, Sudipta Roy; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 2028-2038

Abstract


Medical image captioning plays an important role in modern healthcare improving clinical report generation and aiding radiologists in detecting abnormalities and reducing misdiagnosis. The complex visual and textual data biases make this task more challenging. Recent advancements in transformer-based models have significantly improved the generation of radiology reports from medical images. However these models require substantial computational resources for training and have been observed to produce unnatural language outputs when trained solely on raw image-text pairs. Our aim is to generate more detailed reports specific to images and to explain the reasoning behind the generated text through image-text alignment. Given the high computational demands of end-to-end model training we introduce a two-step training methodology with an Intelligent Visual Encoder for Bridging Modalities in Report Generation (InVERGe) model. This model incorporates a lightweight transformer known as the Cross-Modal Query Fusion Layer (CMQFL) which utilizes the output from a frozen encoder to identify the most relevant text-grounded image embedding. This layer bridges the gap between the encoder and decoder significantly reducing the workload on the decoder and enhancing the alignment between vision and language. Our experimental results conducted using the MIMIC-CXR Indiana University chest X-ray images and CDD-CESM breast images datasets demonstrate the effectiveness of our approach. Code: https://github.com/labsroy007/InVERGe

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Deria_2024_CVPR, author = {Deria, Ankan and Kumar, Komal and Chakraborty, Snehashis and Mahapatra, Dwarikanath and Roy, Sudipta}, title = {InVERGe: Intelligent Visual Encoder for Bridging Modalities in Report Generation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {2028-2038} }