Plug-and-Think: Structured Reasoning for Vision-Language-Action Models

Kaikai Wei, Di wen, Xinhai Li, Senwei Xiang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings, 2026, pp. 3136-3145

Abstract


Vision--Language--Action (VLA) systems often fail on out-of-distribution tasks and lack interpretable reasoning, requiring costly retraining to adapt. We address these limitations by proposing BridgeLang, a lightweight external reasoning supplementor that enhances unmodified, pre-trained VLA models. BridgeLang is an efficient visual language model trained on our new Bridge-CoT dataset using a prompt-based instruction-finetuning strategy. This preserves its general abilities while teaching it to act as a scene-aware planner. Given an initial observation and task, BridgeLang performs hierarchical reasoning---internally identifying <objects> and <relations> as a scaffold to generate a high-quality, executable <subgoals> plan. This "think-before-act" process occurs once, after which only the semantically cleaned subgoals string is concatenated with the original instruction. When integrated with OpenVLA, BridgeLang improves average success rates on the LIBERO benchmark by +5.45% (up to +8.2%) without any VLA retraining and at the cost of only a small, one-time pre-execution latency. Our work demonstrates the efficacy of decoupled, scaffolded reasoning and introduces the Bridge-CoT dataset to facilitate structured multimodal planning. The dataset and code will be released upon acceptance.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Wei_2026_CVPR, author = {Wei, Kaikai and wen, Di and Li, Xinhai and Xiang, Senwei}, title = {Plug-and-Think: Structured Reasoning for Vision-Language-Action Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings}, month = {June}, year = {2026}, pages = {3136-3145} }