Fusing Pre-Trained Language Models With Multimodal Prompts Through Reinforcement Learning

Yu, Youngjae; Chung, Jiwan; Yun, Heeseung; Hessel, Jack; Park, Jae Sung; Lu, Ximing; Zellers, Rowan; Ammanabrolu, Prithviraj; Le Bras, Ronan; Kim, Gunhee; Choi, Yejin

Youngjae Yu, Jiwan Chung, Heeseung Yun, Jack Hessel, Jae Sung Park, Ximing Lu, Rowan Zellers, Prithviraj Ammanabrolu, Ronan Le Bras, Gunhee Kim, Yejin Choi; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 10845-10856

Abstract

Language models are capable of commonsense reasoning: while domain-specific models can learn from explicit knowledge (e.g. commonsense graphs [6], ethical norms [25]), and larger models like GPT-3 manifest broad commonsense reasoning capacity. Can their knowledge be extended to multimodal inputs such as images and audio without paired domain data? In this work, we propose ESPER (Extending Sensory PErception with Reinforcement learning) which enables text-only pretrained models to address multimodal tasks such as visual commonsense reasoning. Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision: for example, our reward optimization relies only on cosine similarity derived from CLIP and requires no additional paired (image, text) data. Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of multimodal text generation tasks ranging from captioning to commonsense reasoning; these include a new benchmark we collect and release, the ESP dataset, which tasks models with generating the text of several different domains for each image. Our code and data are publicly released at https://github.com/JiwanChung/esper.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Yu_2023_CVPR, author = {Yu, Youngjae and Chung, Jiwan and Yun, Heeseung and Hessel, Jack and Park, Jae Sung and Lu, Ximing and Zellers, Rowan and Ammanabrolu, Prithviraj and Le Bras, Ronan and Kim, Gunhee and Choi, Yejin}, title = {Fusing Pre-Trained Language Models With Multimodal Prompts Through Reinforcement Learning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2023}, pages = {10845-10856} }