-
[pdf]
[supp]
[bibtex]@InProceedings{Zohra_2025_CVPR, author = {Zohra, Fatimah and Zhao, Chen and Liu, Shuming and Ghanem, Bernard}, title = {Effectiveness of Max-Pooling for Fine-Tuning CLIP on Videos}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2025}, pages = {3291-3300} }
Effectiveness of Max-Pooling for Fine-Tuning CLIP on Videos
Abstract
CLIP is a powerful spatial feature extractor trained on a large dataset of image-text pairs. It exhibits strong generalization when extended to other domains and modalities. However, its extension to videos is challenged by the need for additional temporal modeling. While recent works have attempted to bridge this modality gap through the integration of complex modules, we apply a simple and modular approach to enhance CLIP's video understanding on action recognition tasks. In its standard application, CLIP processes each video frame independently, restricting its ability to associate features across frames. To address this, we apply frame-wise max-pooling on the tokens within the transformer layer to construct a new set of tokens that aid the model in extracting temporal information better. We then use max-pooling to aggregate the features into a single video feature. We evaluate the effectiveness of this approach on different action recognition benchmarks, showing that max-pooling is able to help fine-tune the model to extract the features important for temporal modeling. Furthermore, we show that the max-pooling of tokens is particularly useful when applied to the last few layers of the model, which are typically more specialized and abstract for capturing high-level image features. To the best of our knowledge, we are able to achieve SOTA on the base-to-novel and few-shot benchmarks on the Something-SomethingV2 dataset.
Related Material