-
[pdf]
[bibtex]@InProceedings{Xiong_2026_CVPR, author = {Xiong, Honglin and Zhu, Chenjie and Ding, Jianbiao and Ni, Zixuan and Li, Wei and Mi, Zhenpeng and Wang, Qian}, title = {Beyond Sequential Tools: A Unified VLM Agent System for Photographic Post-Processing via Dynamic Multi-Expert Fusion}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {41521-41530} }
Beyond Sequential Tools: A Unified VLM Agent System for Photographic Post-Processing via Dynamic Multi-Expert Fusion
Abstract
Real-world image restoration is challenged by complex, coupled degradations. Existing "all-in-one" models often lack generalization, while agentic systems suffer from inefficient sequential tool invocation. We propose a VLM-guided one-shot framework for universal photographic post-processing. Our system employs a Vision-Language Model (VLM) as an orchestrator to perform nuanced intent understanding and degradation analysis, dynamically allocating weights to a suite of specialized expert LoRA modules. To ensure superior composability, these experts adapt only Key (K) and Value (V) matrices and are simultaneously merged into a pretrained diffusion backbone for synergistic, single-pass restoration. Furthermore, we introduce a lightweight branch trained via Direct Preference Optimization (DPO) to ensure perceptually optimal weight allocation. Our method achieves state-of-the-art performance across diverse synthetic and real-world datasets. Crucially, it demonstrates remarkable zero-shot generalization on authentic real-world data without additional fine-tuning.
Related Material

