Representation Learning Through Cross-Modal Conditional Teacher-Student Training For Speech Emotion Recognition

  title={Representation Learning Through Cross-Modal Conditional Teacher-Student Training For Speech Emotion Recognition},
  author={Sundararajan Srinivasan and Zhaocheng Huang and Katrin Kirchhoff},
Generic pre-trained speech and text representations promise to reduce the need for large labeled datasets on specific speech and language tasks. However, it is not clear how to effectively adapt these representations for speech emotion recognition. Recent public benchmarks show the efficacy of several popular self-supervised speech representations for emotion classification. In this study, we show that the primary difference between the top-performing representations is in predicting valence… 

Figures and Tables from this paper

Probing Speech Emotion Recognition Transformers for Linguistic Knowledge
These findings show that transformers can successfully leverage linguistic information to improve their valence predictions, and that linguistic analysis should be included in their testing.
Dawn of the transformer era in speech emotion recognition: closing the valence gap
Transformer-based architectures constitute the new state-of-the-art in SER, but further advances are needed to mitigate remaining robustness and individual speaker issues, and the first to show that their extraordinary success on valence is based on implicit linguistic information learnt during finetuning of the transformer layers.
Comparing supervised and self-supervised embedding for ExVo Multi-Task learning track
The studies show that the best performance is obtained with a hybrid approach, where predictions derived via both SSL and task-specific supervised learning are used, and the best system on test-set surpasses the ComPARE baseline.


Using Large Pre-Trained Models with Cross-Modal Attention for Multi-Modal Emotion Recognition
This work proposes using large self-supervised pretrained models for both audio and text modality with crossmodality attention for multimodal emotion recognition with a 1.88% absolute improvement in accuracy compared to the previous state-of-the-art method on the IEMOCAP dataset.
Dimensional speech emotion recognition from speech features and word embeddings by using multitask learning
The findings suggest that the use of MTL with two parameters is better than other evaluated methods in representing the interrelation of emotional attributes in representation of categorical and dimensional emotion results from psychological and engineering perspectives.
Contrastive Unsupervised Learning for Speech Emotion Recognition
  • Mao Li, Bo Yang, Chao Wang
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
It is shown that the contrastive predictive coding (CPC) method can learn salient representations from unlabeled datasets, which improves emotion recognition performance, including concordance correlation coefficient performance on IEMOCAP.
Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network
This paper proposes a solution to the problem of `context-aware' emotional relevant feature extraction, by combining Convolutional Neural Networks (CNNs) with LSTM networks, in order to automatically learn the best representation of the speech signal directly from the raw time representation.
Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings
This work proposes a way to combine the output of several layers from the pre-trained model, producing richer speech representations than the model’s output alone, and shows that the best performing models have better average recall than previous approaches that use deep neural networks trained on spectrograms and waveforms or shallow neural networkstrained on features extracted from wav2vec 1.0.
Leveraging Pre-trained Language Model for Speech Sentiment Analysis
This paper investigates the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis and proposes a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach to take advantage of a large, but unlabeled speech dataset for training.
An Efficient Temporal Modeling Approach for Speech Emotion Recognition by Mapping Varied Duration Sentences into Fixed Number of Chunks
A novel data processing approach that extracts a fixed number of small chunks over sentences of different durations by changing the overlap between these chunks, providing an ideal framework to combine gated network or attention mechanisms with long short-term memory (LSTM) networks.
Exploiting Acoustic and Lexical Properties of Phonemes to Recognize Valence from Speech
This paper investigates how to jointly consider both factors to improve the prediction of emotional valence, and the relationship between improved prediction and the emotion elicitation process, and presents a network that exploits both the acoustic and the lexical properties of phonetic information using multi-stage fusion.
SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding
A novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules and improves the previous state-of-the-art performance on the Spoken SQuAD dataset by more than 10%.
Layer-Wise Analysis of a Self-Supervised Speech Representation Model
This work examines one recent and successful pre-trained model (wav2vec 2.0), via its intermediate representation vectors, using a suite of analysis tools to characterize the evolution of information across model layers, and understand how fine-tuning the model for automatic speech recognition (ASR) affects these observations.