Analysis of Deep Fusion Strategies for Multi-Modal Gesture Recognition

Alina Roitberg, Tim Pollert, Monica Haurilet, Manuel Martin, Rainer Stiefelhagen; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2019, pp. 0-0

Abstract


Video-based gesture recognition has a wide spectrum of applications, ranging from sign language understanding to driver monitoring in autonomous cars. As different sensors suffer from their individual limitations, combining multiple sources has strong potential to improve the results. A number of deep architectures have been proposed to recognize gestures from e.g. both color and depth data. However, these models conventionally comprise separate networks for each modality, which are then combined in the final layer (e.g. via simple score averaging). In this work, we take a closer look at different fusion strategies for gesture recognition especially focusing on the information exchange in the intermediate layers. We compare three fusion strategies on the widely used C3D architecture: 1) late fusion, combining the streams in the final layer; 2) information exchange in an intermediate layer using an additional convolution layer; and 3) linking information at multiple layers simultaneously using the cross-stitch units, originally designed for multi-task learning. Our proposed C3D-Stitch model achieves the best recognition rate, demonstrating the effectiveness of sharing information at earlier stages.

Related Material


[pdf]
[bibtex]
@InProceedings{Roitberg_2019_CVPR_Workshops,
author = {Roitberg, Alina and Pollert, Tim and Haurilet, Monica and Martin, Manuel and Stiefelhagen, Rainer},
title = {Analysis of Deep Fusion Strategies for Multi-Modal Gesture Recognition},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2019}
}