MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark

Shaar, Shaden; Thymes, Bradon; Chaixanien, Sirawut; Cardie, Claire; Hariharan, Bharath

Shaden Shaar, Bradon Thymes, Sirawut Chaixanien, Claire Cardie, Bharath Hariharan; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 4537-4546

Abstract

Understanding real-world videos such as movies requires integrating visual and dialogue cues. Yet existing VideoQA benchmarks struggle to capture this multimodal reasoning and, given the difficulty of evaluating free-form answers, largely resort to simple multiple choice questions. We introduce a novel open-ended multimodal VideoQA benchmark, MovieRecapsQA, created using movie recap videos -- a distinctive type of YouTube content that summarizes a film via a voiceover description of key clips from the movie (recap video). From the transcribed voiceover (recap summary) of 60 recap videos, we generate 8.2K questions along with the necessary "facts" expected in each answer; the former facilitates the creation of questions that require mutimodal reasoning and the latter allow the construction of a reference-free evaluation metric that can be applied to open-ended responses. To our knowledge, this is the first reference-free open-ended VideoQA benchmark. The benchmark allows each question to be evaluated in different input video settings: given (a) the full-length movie, (b) the full ( 11 min) recap video (visual only), (c) 14 min of aligned movie scenes, i.e, movie scenes relevant to the question, and (d) 1.2 min of aligned recap video scenes. In all cases, the text of any associated movie dialogue is provided. Each question is categorized by the modality required to answer it---visual, dialogue, or both---enabling fine-grained evaluation of multimodal capabilities. We benchmark (setting (d)) seven state-of-the-art MLLMs and find that (i) only our reference-free metric produces meaningful human-aligned model separation; (ii) vision-centric questions yield the lowest scores across all models; (iii) removing visual input often improves model factuality; and (iv) the primary bottleneck is visual perception, not visual reasoning.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Shaar_2026_CVPR, author = {Shaar, Shaden and Thymes, Bradon and Chaixanien, Sirawut and Cardie, Claire and Hariharan, Bharath}, title = {MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {4537-4546} }