Rethinking Training Data for Mitigating Representation Biases in Action Recognition
The purpose of this study is to train spatiotemporal 3D convolutional neural networks (3D CNNs) that properly leverage temporal information to recognize actions. Though 3D CNNs are an effective framework in action recognition, some studies showed the biases of video datasets for generic action recognition lead 3D CNNs to recognize not dynamic motions but static cues, such as objects, scenes, and people. On the other hand, video datasets for fine-grained action recognition, which classifies various actions in a specific domain, are expected to have small biases compared with the datasets for generic action recognition. In this study, we examine the biases of various video datasets, which include both generic and fine-grained action recognition tasks, for training 3D CNNs. Based on the results of experiments, the following conclusions could be obtained: (i) The representation biases learned from fine-grained action recognition datasets are smaller than those of generic action recognition datasets. (ii) The models pretrained on fine-grained action recognition datasets, of which the biases are small, leverage temporal information to recognize actions rather than static information. (iii) The models that leverage temporal information achieve better performance on fine-grained action recognition whereas the performance of the models pretrained on biased datasets is better on generic action recognition. We should evaluate models on both generic and fine-grained recognition datasets to properly evaluate their performance.