-
[pdf]
[supp]
[bibtex]@InProceedings{Gao_2026_CVPR, author = {Gao, Hongcheng and Tang, Jingyi and Huang, Zihao and Li, Liang and Su, Li and Huang, Qingming}, title = {TreeReasoner: Reinforcing Tool-Augmented Tree-of-Videos Reasoning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2026}, pages = {11457-11469} }
TreeReasoner: Reinforcing Tool-Augmented Tree-of-Videos Reasoning
Abstract
We present TreeReasoner, a tool-augmented, tree-structured reasoning framework that recasts long-video understanding as an active hypothesis-verification problem over a vast visual search space. By maintaining multiple parallel reasoning paths, the model systematically explores the temporal dimension and, guided by intermediate hypotheses, invokes frame-level tools such as temporal zooming, temporal jumping, and sliding to incrementally search a minimal yet sufficient chain of evidence. The entire framework is trained end-to-end with Tree-of-Tool Relative Policy Optimization (ToT-RPO) following a supervised fine-tuning warmup, achieving superior video-understanding accuracy while decoding far fewer frames than existing methods and exhibiting interpretable temporal localization and causal-verification behaviors. Experiments on six long-video reasoning benchmarks show that TreeReasoner consistently outperforms both standard IO and naive tool-calling baselines. Transferability experiments on hallucination further confirm its generalization and reduced hallucination tendencies. These experiments validate the stability and efficiency of TreeReasoner in complex temporal scenarios.
Related Material

