Robust Latent Representations Via Cross-Modal Translation and Alignment

  title={Robust Latent Representations Via Cross-Modal Translation and Alignment},
  author={Vandana Rajan and Alessio Brutti and Andrea Cavallaro},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • Vandana Rajan, A. Brutti, A. Cavallaro
  • Published 3 November 2020
  • Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Multi-modal learning relates information across observation modalities of the same physical phenomenon to leverage complementary information. Most multi-modal machine learning methods require that all the modalities used for training are also available for testing. This is a limitation when signals from some modalities are unavailable or severely degraded. To address this limitation, we aim to improve the testing performance of uni-modal systems using multiple modalities during training only… 

Figures and Tables from this paper

Cross-Modal Knowledge Transfer via Inter-Modal Translation and Alignment for Affect Recognition
This work aims to improve the performance of uni-modal affect recognition models by transferring knowledge from a better-performing (or stronger) modality to a weaker modality during training, and validate the approach on two multi- modal affect datasets, namely CMU-MOSI for binary sentiment classification and RECOLA for dimensional emotion regression.
Gaze-enhanced Crossmodal Embeddings for Emotion Recognition
This work proposes a new approach to emotion recognition that incorporates an explicit representation of gaze in a crossmodal emotion embedding framework and shows that this method outperforms the previous state of the art for both audio-only and video-only emotion classification on the popular One-Minute Gradual Emotion Recognition dataset.


Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities
This paper proposes a method to learn robust joint representations by translating between modalities based on the key insight that translation from a source to a target modality provides a method of learning joint representations using only the source modality as input.
Audio-Visual Fusion for Sentiment Classification using Cross-Modal Autoencoder
This paper proposes a novel model combining deep canonical correlation analysis (DCCA) with cross-modal autoencoders that tries to reconstruct the representations corresponding to the missing modality, using the DCCA transformed representations of the available input modalities.
Dense Multimodal Fusion for Hierarchically Joint Representation
  • Di Hu, F. Nie, Xuelong Li
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
This paper proposes to densely integrate the representations by greedily stacking multiple shared layers between different modality-specific networks, named as Dense Multimodal Fusion (DMF), which results in faster convergence, lower training loss, and better performance.
EmoBed: Strengthening Monomodal Emotion Recognition via Training with Crossmodal Emotion Embeddings
The obtained results show that the proposed framework significantly outperforms related baselines in monomodal inference, and are also competitive or superior to the recently reported systems, which emphasises the importance of the proposed crossmodal learning for emotion recognition.
Cross and Learn: Cross-Modal Self-Supervision
In this paper we present a self-supervised method for representation learning utilizing two different modalities. Based on the observation that cross-modal information has a high semantic meaning we
Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition With Multimodal Training
This work presents an efficient approach for leveraging the knowledge from multiple modalities in training unimodal 3D convolutional neural networks (3D-CNNs) for the task of dynamic hand gesture recognition, and introduces a "spatiotemporal semantic alignment" loss (SSA) to align the content of the features from different networks.
Multi-modal Sentiment Analysis using Deep Canonical Correlation Analysis
This paper learns multi-modal embeddings from text, audio, and video views/modes of data in order to improve upon down-stream sentiment classification and posit that this highly optimized algorithm dominates over the contribution of other views, though each view does contribute to the final result.
M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual, and Speech Cues
This work presents M3ER, a learning-based method for emotion recognition from multiple input modalities that combines cues from multiple co-occurring modalities and is more robust than other methods to sensor noise in any of the individual modalities.
Learning with Privileged Information via Adversarial Discriminative Modality Distillation
A new approach to train a hallucination network that learns to distill depth information via adversarial learning is proposed, resulting in a clean approach without several losses to balance or hyperparameters.
End-to-End Multimodal Emotion Recognition Using Deep Neural Networks
This work proposes an emotion recognition system using auditory and visual modalities using a convolutional neural network to extract features from the speech, while for the visual modality a deep residual network of 50 layers is used.