RMultiplex200K: Toward Reliable Multimodal Process Supervision for Visual Language Models on Telecommunications

Sijia Chen, Bin Song; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 1686-1696

Abstract


Visual Language Models (VLMs) have achieved remarkable success in many domains due to their ability to perform step-by-step reasoning. However, progress in the telecommunication (Telecom) domain remains limited, primarily due to the lack of high-quality datasets and domain-specific insights. In this paper, we introduce RMultiplex200K, a multimodal dataset designed to present step-wise reasoning rationales and correctness scores for real-world Telecom questions. This enables VLMs to engage in step-level reasoning and verification using multimodal information, thereby facilitating reliable problem-solving. RMultiplex200K is highly scalable as it is constructed without human annotations, relying instead on our automatic plan-based annotation (ApPA) method, which automatically synthesizes reasoning steps labeled with reward scores. With this dataset, we introduce TC-NAVIGATOR, a new mechanism for training multimodal process reward models to serve as reliable reasoning verifiers for VLMs. For instance, the Qwen-2-VL-72B and Llama-3.2-90B models, which initially achieve only 21.3% and 19.8% respectively on practice Telecom questions, reached 48.5% and 46.1% accuracy, respectively, after training with RMultiplex200K and verifying with TC-NAVIGATOR.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Chen_2025_ICCV, author = {Chen, Sijia and Song, Bin}, title = {RMultiplex200K: Toward Reliable Multimodal Process Supervision for Visual Language Models on Telecommunications}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {1686-1696} }