Multi-Classifier Interactive Learning for Ambiguous Speech Emotion Recognition

@article{Zhou2022MultiClassifierIL,
  title={Multi-Classifier Interactive Learning for Ambiguous Speech Emotion Recognition},
  author={Ying Zhou and Xuefeng Liang and Yu Gu and Yifei Yin and Longshan Yao},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  year={2022},
  volume={30},
  pages={695-705}
}
In recent years, speech emotion recognition technology is of great significance in widespread applications such as call centers, social robots and health care. Thus, the speech emotion recognition has been attracted much attention in both industry and academic. Since emotions existing in an entire utterance may have varied probabilities, speech emotion is likely to be ambiguous, which poses great challenges to recognition tasks. However, previous studies commonly assigned a single-label or… 

Using Crowdsourcing to Train Facial Emotion Machine Learning Models with Ambiguous Labels

TLDR
This work replaces traditional one-hot encoded label representations with a crowd's distribution of labels and demonstrates that the consensus labels from the crowd tend to match the consensus from the original CAFE raters, validating the utility of crowdsourcing.

Training Affective Computer Vision Models by Crowdsourcing Soft-Target Labels

TLDR
Crowdsourcing, including a sufficient filtering mechanism for selecting reliable crowd workers, is a feasible solution for acquiring soft-target labels and an emotion detection classifier trained with these labels are evaluated.

Self-labeling with feature transfer for speech emotion recognition

References

SHOWING 1-10 OF 34 REFERENCES

Investigation on Joint Representation Learning for Robust Feature Extraction in Speech Emotion Recognition

TLDR
A two-channel SER system (HSF-CRNN) is developed to jointly learn the emotion-related features with better discriminative property and considering that the time duration of speech segment significantly affects the accuracy of emotion recognition, another two- channel SER system is proposed where CRNN features extracted from different time scale of spectrogram segment are used for joint representation learning.

Soft-Target Training with Ambiguous Emotional Utterances for DNN-Based Speech Emotion Classification

TLDR
This paper modifies the soft-target training in order to effectively handle both clear and ambiguous emotional utterances, and yields performance improvements in terms of both weighted and unweighted accuracies.

Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network

TLDR
This paper proposes a solution to the problem of `context-aware' emotional relevant feature extraction, by combining Convolutional Neural Networks (CNNs) with LSTM networks, in order to automatically learn the best representation of the speech signal directly from the raw time representation.

Predicting Categorical Emotions by Jointly Learning Primary and Secondary Emotions through Multitask Learning

TLDR
This work takes advantage of both types of annotations to improve the performance of emotion classification and shows that considering secondary emotion labels during the learning process leads to relative improvements of 7.9% in F1-score for an 8-class emotion classification task.

Learning Discriminative Features from Spectrograms Using Center Loss for Speech Emotion Recognition

TLDR
A novel approach to learn discriminative features from variable length spectrograms for emotion recognition by cooperating soft-max cross-entropy loss and center loss together, which leads to network learning more effective features for emotion Recognition.

Speech Emotion Recognition Using Capsule Networks

  • Xixin WuSongxiang Liu H. Meng
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
This paper presents a novel architecture based on the capsule networks (CapsNets) for SER that can take into account the spatial relationship of speech features in spectrograms, and provide an effective pooling method for obtaining utterance global features.

Modeling subjectiveness in emotion recognition with deep neural networks: Ensembles vs soft labels

TLDR
This paper proposes two viable approaches to model the subjectiveness of emotions by incorporating inter-annotator variability, which are soft labels and model ensembling, where each model represents an annotator, and demonstrates that both approaches lead to consistent improvement over using ground truth labels.

An Image-based Deep Spectrum Feature Representation for the Recognition of Emotional Speech

TLDR
Key results indicate that deep-spectrum features are comparable in performance with the other tested acoustic feature representations in matched for noise type train-test conditions; however, the BoAW paradigm is better suited to cross-noise-type train- test conditions.

A multiple perception model on emotional speech

  • J. TaoAi-jun LiShifeng Pan
  • Psychology
    2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops
  • 2009
TLDR
A statistical model is created to simulate emotion perception and it is found that there is an underlying consistency in the patterns of responses.

Speech Emotion Recognition Using Voiced Segment Selection Algorithm

TLDR
A new algorithm, the Voiced Segment Selection (VSS) algorithm, which can produce an accurate segmentation of speech signals and has potential to improve performance of emotion recognition from speech.