Estimating Multiple Emotion Descriptors by Separating Description and Inference
To describe complex emotional states, psychologists have proposed multiple emotion descriptors: sparse descriptors like facial action units, continuous descriptors like valence and arousal, and discrete class descriptors like the expressions of happiness and anger. According to Cohn et al. , facial action units are sign vehicles that convey the emotion message, while discrete or continuous emotion descriptors are the messages perceived by observers. They differ in their focuses. Sign vehicles focus on describing facial behavior. Emotion messages focus on an observer's inference about the underlying state of the subject from facial behavior. We describe a novel architecture for multiple emotion descriptor estimation that incorporates this prior knowledge about the differences between descriptive labels (sign vehicles, like facial action units) and inferential labels (emotion messages like discrete emotion expressions, valence, and arousal). In our multi-level architecture, a common set of low-level features of facial regions are fed into two separate branches: one for descriptive labels and the other for inferential labels. The differences between these two branches reflects the differences between the two types of labels. Sign vehicles are typically more specific and spatially localized. Emotion messages are reflected across the entire face. Our experiments on the ABAW3 challenge dataset demonstrate this approach outperforms all other submitted approaches to multi-task learning. Code is available at https://github.com/HKUST-NISL/ABAW3_MultiEmotionNet.