Self-attention fusion for audiovisual emotion recognition with incomplete data

  title={Self-attention fusion for audiovisual emotion recognition with incomplete data},
  author={Kateryna Chumachenko and Alexandros Iosifidis and Moncef Gabbouj},
  journal={2022 26th International Conference on Pattern Recognition (ICPR)},
In this paper, we consider the problem of multi-modal data analysis with a use case of audiovisual emotion recognition. We propose an architecture capable of learning from raw data and describe three variants of it with distinct modality fusion mechanisms. While most of the previous works consider the ideal scenario of presence of both modalities at all times during inference, we evaluate the robustness of the model in the unconstrained settings where one modality is absent or noisy, and… 

Figures and Tables from this paper

Continual Transformers: Redundancy-Free Attention for Online Inference

Novel formulations of the Scaled Dot-Product Attention are proposed, which enable Transformers to performcient online token-by-token inference on a continual input stream and reduce the point operations per prediction by up to 63 × and 2.6 ×, respectively, while retaining predictive performance.



Multimodal Transformer Fusion for Continuous Emotion Recognition

The Transformer model is utilized to fuse audio-visual modalities on the model level to improve the performance of emotion recognition, and the superiority of model level fusion than other fusion strategies is shown.

Multimodal Emotion Recognition Using Cross-Modal Attention and 1D Convolutional Neural Networks

In this work, we propose a new approach for multimodal emotion recognition using cross-modal attention and raw waveform based convolutional neural networks. Our approach uses audio and text

Audiovisual SlowFast Networks for Video Recognition

This work reports state-of-the-art results on six video action classification and detection datasets, performs detailed ablation studies, and shows the generalization of AVSlowFast to learn self-supervised audiovisual features.

Speech Emotion Recognition Using Deep Learning Techniques: A Review

An overview of Deep Learning techniques is presented and some recent literature where these methods are utilized for speech-based emotion recognition is discussed, including databases used, emotions extracted, contributions made toward speech emotion recognition and limitations related to it.

Multimodal Transformer for Unaligned Multimodal Language Sequences

Comprehensive experiments on both aligned and non-aligned multimodal time-series show that the MulT model outperforms state-of-the-art methods by a large margin, and empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed cross modal attention mechanism in MulT.

Robust Lightweight Facial Expression Recognition Network with Label Distribution Training

This paper presents an efficiently robust facial expression recognition (FER) network, named EfficientFace, which holds much fewer parameters but more robust to the FER in the wild, and introduces a simple but efficient label distribution learning (LDL) method as a novel training strategy.

MMTM: Multimodal Transfer Module for CNN Fusion

A simple neural network module for leveraging the knowledge from multiple modalities in convolutional neural networks, named Multimodal Transfer Module (MMTM), which improves the recognition accuracy of well-known multimodal networks.

Learning to ignore: rethinking attention in CNNs

This work proposes to reformulate the attention mechanism in CNNs to learn to ignore instead of learning to attend, and proves that learning to ignore, i.e., implicit attention, yields superior performance compared to the standard approaches.

Deep Multimodal Fusion by Channel Exchanging

Channel-Exchanging-Network is proposed, a parameter-free multimodal fusion framework that dynamically exchanges channels between sub-networks of different modalities that is self-guided by individual channel importance that is measured by the magnitude of Batch-Normalization (BN) scaling factor during training.

Automatic social signal analysis: Facial expression recognition using difference convolution neural network