Representation Learning Through Cross-Modal Conditional Teacher-Student Training For Speech Emotion Recognition
@inproceedings{Srinivasan2022RepresentationLT, title={Representation Learning Through Cross-Modal Conditional Teacher-Student Training For Speech Emotion Recognition}, author={Sundararajan Srinivasan and Zhaocheng Huang and Katrin Kirchhoff}, year={2022} }
Generic pre-trained speech and text representations promise to reduce the need for large labeled datasets on specific speech and language tasks. However, it is not clear how to effectively adapt these representations for speech emotion recognition. Recent public benchmarks show the efficacy of several popular self-supervised speech representations for emotion classification. In this study, we show that the primary difference between the top-performing representations is in predicting valence…
2 Citations
Probing Speech Emotion Recognition Transformers for Linguistic Knowledge
- Computer ScienceArXiv
- 2022
These findings show that transformers can successfully leverage linguistic information to improve their valence predictions, and that linguistic analysis should be included in their testing.
Dawn of the transformer era in speech emotion recognition: closing the valence gap
- Computer ScienceArXiv
- 2022
Transformer-based architectures constitute the new state-of-the-art in SER, but further advances are needed to mitigate remaining robustness and individual speaker issues, and the first to show that their extraordinary success on valence is based on implicit linguistic information learnt during finetuning of the transformer layers.
References
SHOWING 1-10 OF 34 REFERENCES
Using Large Pre-Trained Models with Cross-Modal Attention for Multi-Modal Emotion Recognition
- Computer ScienceArXiv
- 2021
This work proposes using large self-supervised pretrained models for both audio and text modality with crossmodality attention for multimodal emotion recognition with a 1.88% absolute improvement in accuracy compared to the previous state-of-the-art method on the IEMOCAP dataset.
Dimensional speech emotion recognition from speech features and word embeddings by using multitask learning
- Computer ScienceAPSIPA Transactions on Signal and Information Processing
- 2020
The findings suggest that the use of MTL with two parameters is better than other evaluated methods in representing the interrelation of emotional attributes in representation of categorical and dimensional emotion results from psychological and engineering perspectives.
Contrastive Unsupervised Learning for Speech Emotion Recognition
- Computer ScienceICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
It is shown that the contrastive predictive coding (CPC) method can learn salient representations from unlabeled datasets, which improves emotion recognition performance, including concordance correlation coefficient performance on IEMOCAP.
Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network
- Computer Science2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2016
This paper proposes a solution to the problem of `context-aware' emotional relevant feature extraction, by combining Convolutional Neural Networks (CNNs) with LSTM networks, in order to automatically learn the best representation of the speech signal directly from the raw time representation.
Leveraging Pre-trained Language Model for Speech Sentiment Analysis
- Computer ScienceInterspeech
- 2021
This paper investigates the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis and proposes a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach to take advantage of a large, but unlabeled speech dataset for training.
An Efficient Temporal Modeling Approach for Speech Emotion Recognition by Mapping Varied Duration Sentences into Fixed Number of Chunks
- Computer ScienceINTERSPEECH
- 2020
A novel data processing approach that extracts a fixed number of small chunks over sentences of different durations by changing the overlap between these chunks, providing an ideal framework to combine gated network or attention mechanisms with long short-term memory (LSTM) networks.
Exploiting Acoustic and Lexical Properties of Phonemes to Recognize Valence from Speech
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
This paper investigates how to jointly consider both factors to improve the prediction of emotional valence, and the relationship between improved prediction and the emotion elicitation process, and presents a network that exploits both the acoustic and the lexical properties of phonetic information using multi-stage fusion.
SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding
- Computer ScienceNAACL
- 2021
A novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules and improves the previous state-of-the-art performance on the Spoken SQuAD dataset by more than 10%.
Layer-Wise Analysis of a Self-Supervised Speech Representation Model
- Computer Science2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2021
This work examines one recent and successful pre-trained model (wav2vec 2.0), via its intermediate representation vectors, using a suite of analysis tools to characterize the evolution of information across model layers, and understand how fine-tuning the model for automatic speech recognition (ASR) affects these observations.
Discriminatively Trained Recurrent Neural Networks for Continuous Dimensional Emotion Recognition from Audio
- Computer ScienceIJCAI
- 2016
This paper introduces a technique for the discriminative training of deep neural networks using the concordance correlation coefficient as cost function, which unites both correlation and mean squared error in a single differentiable function.