Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition

Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition

Gnana Praveen, Eric Granger, Patrick Cardinal

FG 2021: IEEE Int’l Conf. on Automatic Face and Gesture Recognition, Jodhpur, India. 2021/12/15

Abstract : Multimodal analysis has recently drawn much interest in affective computing, since it can improve the overall accuracy of emotion recognition over isolated uni-modal approaches. The most effective techniques for multimodal emotion recognition efficiently leverage diverse and complimentary sources of information, such as facial, vocal, and physiological modalities, to provide comprehensive feature representations. In this paper, we focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos, where complex spatiotemporal relationships may be captured. Most of the existing fusion techniques rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complimentary nature of audiovisual (A-V) modalities. We introduce a cross-attentional fusion approach to extract the salient features across A – V modalities, allowing for …