A Hybrid Visual Transformer for Efficient Deep Human Activity Recognition

Youcef Djenouri, Ahmed Nabil Belbachir; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2023, pp. 721-730

Abstract


Human Activity Recognition (HAR) has gained significant attention in recent years due to its wide-ranging applications. This paper introduces a novel hybrid visual transformer methodology designed to enhance the robust analysis and comprehension of activities. CVTN (Convolution Visual Transformer Network) leverages sensor data represented jointly in spatial and temporal dimensions to enhance the resilience of the HAR process. The proposed technique employs a hybrid model that integrates Convolutional Neural Networks (CNNs) and Visual Transformers (VTs). Initially, the CNN component learns spatial visual features from diverse sensor data. Subsequently, these acquired visual features are inputted into the transformer segment of the model. VT captures temporal insights by observing sensor statuses across different time points. The efficacy of the CVTN methodology is assessed using the Kinetics dataset, which emulates real-world human activity recognition scenarios. The experimental results reveal clear superiority compared to the recent baseline HAR solutions, reaffirming its potential for advancing activity analysis.

Related Material


[pdf]
[bibtex]
@InProceedings{Djenouri_2023_ICCV, author = {Djenouri, Youcef and Belbachir, Ahmed Nabil}, title = {A Hybrid Visual Transformer for Efficient Deep Human Activity Recognition}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2023}, pages = {721-730} }