SST-EM: Advanced Metrics for Evaluating Semantic Spatial and Temporal Aspects in Video Editing

Varun Biyyala, Bharat Chanderprakash Kathuria, Jialu Li, Youshan Zhang; Proceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops, 2025, pp. 259-268

Abstract


Video editing models have advanced significantly but evaluating their performance remains challenging. Traditional metrics such as CLIP text and image scores often fall short: text scores are limited by inadequate training data and hierarchical dependencies while image scores fail to assess temporal consistency. We present SST-EM (Semantic Spatial and Temporal Evaluation Metric) a novel evaluation framework that leverages modern Vision-Language Models (VLMs) Object Detection and Temporal Consistency checks. SST-EM comprises four components: (1) semantic extraction from frames using a VLM (2) primary object tracking with Object Detection (3) focused object refinement via an LLM agent and (4) temporal consistency assessment using a Vision Transformer (ViT). These components are integrated into a unified metric with weights derived from human evaluations and regression analysis. The name SST-EM reflects its focus on Semantic Spatial and Temporal aspects of video evaluation. SST-EM provides a comprehensive evaluation of semantic fidelity and temporal smoothness in video editing. The source code is available in the GitHub Repository.

Related Material


[pdf]
[bibtex]
@InProceedings{Biyyala_2025_WACV, author = {Biyyala, Varun and Kathuria, Bharat Chanderprakash and Li, Jialu and Zhang, Youshan}, title = {SST-EM: Advanced Metrics for Evaluating Semantic Spatial and Temporal Aspects in Video Editing}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops}, month = {February}, year = {2025}, pages = {259-268} }