Task Navigator: Decomposing Complex Tasks for Multimodal Large Language Models

Feipeng Ma, Yizhou Zhou, Yueyi Zhang, Siying Wu, Zheyu Zhang, Zilong He, Fengyun Rao, Xiaoyan Sun; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 2248-2257

Abstract


Inspired by the remarkable progress achieved by recent Large Language Models (LLMs) Multimodal Large Language Models (MLLMs) take LLMs as their brains and have achieved surprising results in many downstream tasks by training on a large amount of task-specific data. However when faced with complex tasks that require the collaboration of multiple capabilities existing MLLMs recollect training data and retrain the model ignoring the systematic utilization of LLMs and their possessed capabilities learned in downstream tasks. Inspired by the way humans tackle complex questions in this paper we propose a novel framework called Task Navigator. In our framework LLMs act as navigators to chart a viable path for solving complex tasks and guide MLLMs through the process step by step. Specifically LLMs iteratively break down sub-problems and refine them to be more reasonable and answerable which are subsequently resolved by MLLMs to obtain relevant sub-answers until the LLMs have collected enough information to answer the initial question. Task Navigator provides an effective way to extend MLLMs to tackle complex tasks thus broadening MLLMs' applicability. To evaluate the performance of the proposed framework we have curated a carefully designed benchmark called VersaChallenge. Experiments on VersaChallenge demonstrate the effectiveness of our proposed method.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Ma_2024_CVPR, author = {Ma, Feipeng and Zhou, Yizhou and Zhang, Yueyi and Wu, Siying and Zhang, Zheyu and He, Zilong and Rao, Fengyun and Sun, Xiaoyan}, title = {Task Navigator: Decomposing Complex Tasks for Multimodal Large Language Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {2248-2257} }