A Cross-Domain Approach for Continuous Impression Recognition from Dyadic Audio-Visual-Physio Signals

  title={A Cross-Domain Approach for Continuous Impression Recognition from Dyadic Audio-Visual-Physio Signals},
  author={Yuanchao Li and Catherine Lai},
The impression we make on others depends not only on what we say, but also, to a large extent, on how we say it. As a sub-branch of affective computing and social signal processing, impression recognition has proven critical in both human-human conversations and spoken dialogue systems. However, most research has studied impressions only from the signals ex-pressed by the emitter, ignoring the response from the receiver. In this paper, we perform impression recognition using a proposed cross… 

Figures and Tables from this paper

Multimodal Dyadic Impression Recognition via Listener Adaptive Cross-Domain Fusion

This paper performs impression recognition using a proposed listener adaptive cross-domain architecture, which consists of a listener adaptation function to model the causality between speaker and listener behaviors and a cross- domain fusion function to strengthen their connection.



Speech Emotion Recognition from Variable-Length Inputs with Triplet Loss Function

This work proposes a triplet framework based on Long Short-Term Memory Neural Network (LSTM) for speech emotion recognition that learns a mapping from acoustic features to discriminative embedding features, which is regarded as basis of testing with SVM.

Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning

A speech emotion recognition (SER) method using end-to-end (E2E) multitask learning with self attention to deal with several issues is proposed, which outperforms the state-of-the-art methods and improves the overall accuracy.

Fusing ASR Outputs in Joint Training for Speech Emotion Recognition

  • Yuanchao LiP. BellCatherine Lai
  • Computer Science
    ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2022
Experiments show that in joint ASR-SER training, incorporating both ASR hidden and text output using a hierarchical co-attention fusion approach improves the SER performance the most.

Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms

A new implementation of emotion recognition from the para-lingual information in the speech, based on a deep neural network, applied directly to spectrograms, achieves higher recognition accuracy compared to previously published results, while also limiting the latency.

Audio-Oriented Multimodal Machine Comprehension via Dynamic Inter- and Intra-modality Attention

A Dynamic Inter- and Intra-modality Attention (DIIA) model is proposed to effectively fuse the two modalities (audio and textual) in Audio-Oriented Multimodal Machine Comprehension, making fair comparisons possible between the model and the existing unimodal MC models.

Recognizing emotions in spoken dialogue with hierarchically fused acoustic and lexical features

The Hierarchical fusion strategy for multimodal emotion recognition is proposed, which incorporates global or more abstract features at higher levels of its knowledge-inspired structure and consistently outperforms both Feature-Level and Decision-Level fusion.

Analyzing first impressions of warmth and competence from observable nonverbal cues in expert-novice interactions

The analysis of a corpus of dyadic expert-novice knowledge sharing interactions aims at investigating the relationship between observed non-verbal cues and first impressions formation of warmth and competence, and provides interesting insights about the role of rest poses.

A review of affective computing: From unimodal analysis to multimodal fusion

Multitask Learning and Multistage Fusion for Dimensional Audiovisual Emotion Recognition

  • Bagus Tris AtmajaM. Akagi
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
This paper proposes two methods to predict emotional attributes from audio and visual data using a multitask learning and a fusion strategy and a multistage fusion is proposed to combine results from various modalities’ final prediction.

The YouTube Lens: Crowdsourced Personality Impressions and Audiovisual Analysis of Vlogs

This work investigates the feasibility of crowdsourcing personality impressions from vlogging as a way to obtain judgements from a variate audience that consumes social media video, and addresses the task of automatic prediction of vloggers' personality impressions using nonverbal cues and machine learning techniques.