Multimodal Interpretable Depression Analysis using Visual Physiological Audio and Textual Data

Puneet Kumar, Shreshtha Misra, Zhuhong Shao, Bin Zhu, Balasubramanian Raman, Xiaobai Li; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 5305-5315

Abstract


Motivated by depression's significant impact on global health this work proposes MultiDepNet a novel multimodal interpretable depression detection system integrating visual physiological audio and textual data. Through dedicated feature extraction methods (MTCNN for video TS-CAN for physiological ResNet-18 for audio and RoBERTa for text modalities) and a strategic fusion of modality-specific networks including CNN-RNN Transformer MLP and ResNet-18 it achieves significant advancements in depression detection. Its performance evaluated across four benchmark datasets (AVEC 2013 AVEC 2014 DAIC and E-DAIC) demonstrates average MAE of 5.64 RMSE of 7.15 accuracy of 74.19 precision of 0.7373 recall of 0.7378 and F1 of 0.7376. It also implements a MultiViz-based interpretability mechanism that computes each modality's contribution to the model's performance. The results reveal the visual modality to be the most significant contributing 37.88% towards depression detection.

Related Material


[pdf]
[bibtex]
@InProceedings{Kumar_2025_WACV, author = {Kumar, Puneet and Misra, Shreshtha and Shao, Zhuhong and Zhu, Bin and Raman, Balasubramanian and Li, Xiaobai}, title = {Multimodal Interpretable Depression Analysis using Visual Physiological Audio and Textual Data}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {5305-5315} }