- [pdf] [supp] [arXiv]
OCSampler: Compressing Videos to One Clip With Single-Step Sampling
Videos incorporate rich semantics as well as redundant information. Seeking a compact yet effective video representation, e.g., sample informative frames from the entire video, is critical to efficient video recognition. There have been works that formulate frame sampling as a sequential decision task by selecting frames one by one according to their importance. In this paper, we present a more efficient framework named OCSampler, which explores such a representation with one short clip. OCSampler designs a new paradigm of learning instance-specific video condensation policies to select frames only in a single step. Rather than picking up frames sequentially like previous methods, we simply process a whole sequence at once. Accordingly, these policies are derived from a light-weighted skim network together with a simple yet effective policy network. Moreover, we extend the proposed method with a frame number budget, enabling the framework to produce correct predictions in high confidence with as few frames as possible. Experiments on various benchmarks demonstrate the effectiveness of OCSampler over previous methods in terms of accuracy and efficiency. Specifically, it achieves 76.9% mAP and 21.7 GFLOPs on ActivityNet with an impressive throughput: 123.9 Video/s on a single TITAN Xp GPU.