-
[pdf]
[bibtex]@InProceedings{Peruzzo_2025_CVPR, author = {Peruzzo, Elia and Karjauv, Adil and Sebe, Nicu and Ghodrati, Amir and Habibian, Amir}, title = {ADAPTOR: Adaptive Token Reduction for Video Diffusion Transformers}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops}, month = {June}, year = {2025}, pages = {6365-6371} }
ADAPTOR: Adaptive Token Reduction for Video Diffusion Transformers
Abstract
Transformers are becoming ubiquitous across various tasks due to their expressive power and scalability. Isotropic transformer models, in particular, offer structural simplicity, making large-scale training more feasible. However, their computational cost escalates dramatically with high-dimensional inputs, such as videos, due to the quadratic scaling of attention operations. This paper builds on transformer optimization research and applies it to video diffusion transformers (DiTs) for the first time. We introduce ADAPTOR, a lightweight token reduction technique that efficiently compresses video data by exploiting temporal redundancy. Designed for seamless integration into existing architectures, ADAPTOR significantly lowers computational cost while preserving performance. Evaluated on the Open-Sora Plan model and benchmarked with VBench, ADAPTOR reduces TFLOPS by while slightly outperforming competitors in overall quality. Notably, it achieves superior results in the dynamic degree metric, capturing motion more effectively than other methods.
Related Material